Old Dominion University Research Foundab 


a 

O 


DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING 
COLLEGE OF ENGINEERING & TECHNOLOGY 
OLD DOMINION UNIVERSITY 
NORFOLK, VIRGINIA 23529 


/c £^£ 2 ^ 
(2i ? 


STRATEGIES FOR CONCURRENT PROCESSING OF 
COMPLEX ALGORITHMS IN DATA ORIVEN ARCHITECTURES 


By 

John VI. Stoughton, Principal Investigator 
Roland R. Mielke, Co-Principal Investigator 
Sukhamoy Som, Graduate Research Assistant 
Rodrigo Obando, Graduate Research Assistant 
Robert Tymchyshyn, Graduate Research Assistant 

Progress Report 

For the period May 16, 1987 to May 15, 1988 


Prepared for the 

National Aeronautics and Space Administration 
Langley Research Center 
Hampton, VA 23665 


f 


Under 

Research Grant NA6- 1-683 

Mr. Paul J. Hayes, Technical Monitor 
ISD-Information Processing Technology Branch 


(KASA-Cfi- 161329) SIRAIPGIPP fCE CGMCUHBENT 
fBGCJsSSlliG CF CCBPI6X ALGCPllitS lfc DATA 
tfilVEA At CHIIICICI IS Progress Report, 16 Bay 
1S87 - 15 Bay 1S66 (Cld Ccuiricii Oniv.) 
i. fe n CSCL 09B G3/61 


N89-1 14C6 


(Jnclas 
0 1 bbS € 2 


June 1988 



DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING 
COLLEGE OF ENGINEERING & TECHNOLOGY 
OLD DOMINION UNIVERSITY 
NORFOLK, VIRGINIA 23529 


STRATEGIES FOR CONCURRENT PROCESSING OF 
COMPLEX ALGORITHMS IN DATA DRIVEN ARCHITECTURES 


By 

John W. Stoughton, Principal Investigator 
Roland R. Mielke, Co-Principal Investigator 
Sukhamoy Som, Graduate Research Assistant 
Rodrigo Obando, Graduate Research Assistant 
Robert Tynchyshyn, Graduate Research Assistant 

Progress Report 

For the period May 16, 1987 to May 15, 1988 


Prepared for the 

National Aeronautics and Space Administration 
Langley Research Center 
Hampton, VA 23665 


Under 

Research 6rant NAG-1-683 

Mr. Paul J. Hayes, Technical Monitor 
ISD-Information Processing Technology Branch 


Submitted by the 

Old Dominion University Research Foundation 
P. 0. Box 6369 
Norfolk, Virginia 23508 


June 1988 




TTT 


t.ju±u^i. . -iii iic \ c JjXl QMCXi^J d 


•quamasaopua VSVN ^"[dmi 3 ° u S30 P 

pup X-[uo ssauaqaidmoD J03 si auaumoop siqq ui sampu pupaq 30 asn aqx 


Hsoaviosia 




ST RAT EGIES FOR CONCURRENT PROCESSING OF COMPLEX 
ALGORITHMS IN DATA DRIVEN ARCHITECTURES 

By 

John W. Stoughton 1 , Roland R. Mielke 2 , Sukharaoy Som 3 , 

Rodrigo Obando 4 and Robert Tymchyshyn 5 

ABSTRACT 

The purpose of this report is to document research to develop stra- 
tegies for concurrent processing of complex algorithms in data driven archi- 
tectures. The problem domain consists of decision-free algorithms having 
large-grained, computationally complex primitive operations. Such are often 
found in signal processing and control applications. The anticipated multi- 
processor environment is a data flow architecture containing between two and 
twenty computing elements. Each computing element is a processor having 
local program memory, and which communicates with a common global data mem- 
ory. A new graph theoretic model called ATAMM which establishes rules for 
relating a decomposed algorithm to its execution in a data flow architecture 
is presented. The ATAMM model is used to determine strategies to achieve 
optimum time performance and to develop a system diagnostic software tool. 

In addition, preliminary work on a new multiprocessor operating system based 
on the ATAMM specifications is described. 
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I.o IHTRODUCTIOH 


The purpose of this report is to document research to develop strate- 
gies for concurrent processing of complex algorithms in data driven archi- 
tectures. The problem domain consists of decision-free algorithms having 
large-grained, computationally complex primitive operations. The antici- 
pated multiprocessor environment is assumed to contain between two and 
twenty computing elements for concurrent execution of the various primitive 
operations. Each computing element or functional unit is a processor having 
local memory for program storage and temporary input and output data con- 
tainers. The functional units have a common global data memory, and func- 
tional unit activity is coordinated by a graph manager. The global memory 
and graph manager may be either centralized or distributed. The authors 
have proposed a new graph theoretic model to provide a basis for establish- 
ing rules for relating a decomposed algorithm to its execution in' 'a data 
flow environment. The model is identified by the acronym ATAMM which repre- 
sents Algorithm To Architecture Mapping Model. The availability of the 
ATAMM model ils important because it provides a context in which to investi- 
gate algorithm decomposition strategies, it provides a basis for predicting 
and improving time performance, and it identifies the data flow and control 
flow required of any data flow architecture which implements the algorithm. 

During an earlier grant period, May 16, 1986 to May 15, 1987, the au- 
thors formulated the ATAMM model for representing the implementation of a 
decomposed algorithm in a data flow architecture. In addition, a simulation 
tool was developed to display data flow and control flow for algorithms 
operating according to the ATAMM rules. During the present grant period, 

May 16, 1987 to May 15, 1988, the ATAMM model was used to determine analyti- 
cally performance bounds for task computational time and system throughput 
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time. An operating strategy which achieves optimum time performance was 
developed. In addition, a new diagnostic software tool was developed for 
use with the simulation tool. The diagnostic tool monitors detailed system 
operation and displays global system performance indicators and measures. 
Also, a new multiprocessor operating system based on the ATAMM specifica- 
tions is being constructed to validate the ATAMM rules and to provide a 
testbed for further experimentation. It is the purpose of this report to a 
detailed description of the research performed during the present grant 
period . 

In Section II, a overview of research performed during the period May 
16, 1987 to May 15, 1988 is presented. This overview consists of summaries 
of work to develop strategies for optimum time performance, diagnostic soft- 
ware tools, and a testbed operating system. In Section III, the development 
of strategies for optimum time performance is described. The new diaganos- 
tic software tools are explained and illustrated in Section IV. Recommenda- 
tions for continuing and future research are briefly outlined in Section V. 
Two papers describing recent research efforts are included as 
appendices . 
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II. RESEARCH OVERVIEW 


In this section, a summary of research activity conducted during the 
period May 16, 1987 through May 15, 1988 is presented. A more detailed 
description of this work, as well as illustrative examples, is given in the 
following sections and the appendices. 

II. 1 Modeling and Performance 

The development of a new graph theoretic model for describing data and 
control flow associated with the execution of large-grained algorithms in a 
special distributed computing environment is presented. The model is iden- 
tified by the acronym ATAMM which represents Algorithm To Architecture 
Mapping Model. The purpose of such a model is to provide a basis for 
establishing rules for relating an algorithm to its execution in a multi- 
processor environment. Specifications derived from the model lead directly 
to the description of a data flow architecture. The availability of the 
ATAMM model is important for at least three reasons. First, it provides a 
context in which to investigate algorithm decomposition strategies without 
the need to specify a specific computer architecture. Second, the model 
identifies the data flow and control dialog required of any data flow archi- 
tecture which implements the algorithm. Third, the model provides a basis 
for calculating analytically performance bounds for computing speed and 
throughout capacity. 

The problem domain of the ATAMM model consists of decision free algo- 
rithms with computationally complex primitive operations which are assumed 
to be implemented in a dedicated data flow environment. The algorithms are 
such as may be found in (but not limited to) large scale signal processing 
and control applications. The anticipated multiprocessor environment is 
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assumed to consiste of two to twenty processing elements for concurrent 
execution of the various algorithm primitives. 

The development of new computer architectures based upon distributed, 
multiprocessor organizations [1], [2] is motivated mainly by the requirement 
for increased speed and greater throughput capability in complex signal 
processing applications [3]. Recent advances in the production of high- 
density microelectronics [4] has made possible the construction of parallel 
architectures consisting of identical, special purpose computing elements 
[5]. A number of models for describing the behavior of algorithms in this 
setting have been developed [6] - [8]. However, these models represent only 
the data flow and do not adequately display the complex issues of communi- 
cation and control flow which must occur in any realization of the model. 

For this reason, it has been difficult to investigate how to effectively 
match the decomposition and scheduling of algorithms to the structure and 
control of parallel architectures. The importance of better understanding 
the relationship between algorithms and architectures is only now becoming 
recognized [9]. 

A new model useful for understanding the relationship between decom- 
posed algorithms and data flow architectures has been presented. Named 
ATAMM for Algorithm To Architecture Mapping Model, the model consists of 
Petri net marked graphs called the algorithm marked graph, the node marked 
graph, and the computational marked graph. After establishing that the 
computational marked graph is live, safe and consistent, graph time perform- 
ance measures of time between input and output (TBIO), task time (TT) , and 
time between outputs (TBO) are defined. Then lower bounds for the 
performance measures are calculated analytically from the modified algorithm 
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graph and the computational marked graph. A desighn strategy for achieving 
optimum time performance is proposed and illustrated with a design example. 

II. 2 Diagnostic Tool Development 

Although the ATAMM model is not complicated in principle, the execution 
of a system modelled with it becomes hardly tractable when both the number 
of nodes as the number of resources increase. Therefore, it is necessary to 
have Diagnostic Tools to explore the execution of a given algorithm. One of 
the important parameters' necessary to observe is concurrency. Concurrency 
is a measure of the number of resources that work at the same time for a 
specified length of execution of an algorithm. Other parameters include 
TBIO (Time Between Input and Output), TBO (Time Between Outputs), and TBI 
(Time Between Inputs). These parameters refere to the time performance of 
the system: the elapsed time between when input data is read and its 

corresponding output data is written (TBIO), the time elapsed between 
repetitive output writings (TBO), and the time elapsed between repetitive 
inputs data readings (TBI). Another necessary measurements are the time the 
system takes and the different states it goes through to reach steady 
state . 

The Analyzer, a computer program, provides measurement of the items 
denoted above. The input to the program is a file containing a sequential 
account of the execution of a concurrent system. It displays the activity 
of the individual nodes of a graph. This display is drawn on a common time 
axis for easy reading of the concurrent execution of nodes. An alternate 
display is the plotting of the activity of the resources versus time. The 
program also displays the function of concurrency versus time which is now 
called Total Resource Utilization Envelope. For individual data packets, 
the program displays the values of TBIO, TBO and TBI. It also reports 
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general statistics of the transitions per node. This program is primarily 
to be used for post-execution detailed analysis of the execution of an 
algorithm . 

Another computer program, the Graph Simulation/Analyzer, provides not 
only simulation of the execution of an algorithm but also analysis of data 
immediately after execution. It generates the sequential files containing 
firing of transitions in the CMG (Computational Marked Graph) to be analyzed 
by the Analyzer, the program described above. It also generates files with 
average values of TBO, TBI and TBIO. The simulation module has been 
improved so that it may include random variables as the values of the tran- 
sitions in the CMG. It accepts as input an ASCII file containing a descrip- 
tion of the topology of a graph, transition time assignments, priority 
assignment, initial marking, number of resources, etc. 

II. 3 Testbed Development 

A multiprocessor operating system has been developed based on the ATAMM 
specifications. It is the third prototype system to have been built in the 
past two years. The motiviation for this is to give further credibility to 
ATAMM through system validation and to provide a testbed experimentation. 
This discussion is divided into three design phases. In the system parti- 
tioning the ATAMM model is divided into logical components. Combined, these 
logical components must fully represent the ATAMM description. The next 
phase is the hardware mapping in which the logical components are mapped 
into a target architecture. Necessary inter-module communications and 
control dialogue paths must also be specified. The multiprocessor operating 
system implementation is the final design phase and will be referred to 
briefly. 
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Xgtee logical covenants have been isolated in the ATAMM partition; eh. 
Graph Manager (OH). Funcitonal Unit (FUN), and Global Memory (GUO • Tha 
Graph Hanagar ia rasponaibla for implementing tha atata tranaitiona of tha 
procaaaaa. It must monitor all token movement within tha CMC retired to 
determine tha fir.ability of a procaaa. Whan a procaaa can fire tha Graph 
Managar meat aaaign tha firat available Functional Unit to that procaaa. 

The Functional Unit will then execute all three NMG transitions for that 
particular procaaa. It meat also, via intarrupt. update all important token 
movement within tha NMG to tha Graph Managar. Aa a Functional Unit can ha 
aasigned to any procaaa, it meat alao have tha coda available for tha compu- 
tation of every procaaa in tha AMG. tha Global Memory ia tha final logreal 
component in tha partition and ia reaponaible for atoring data aaaociatad 
with all Output Full edges in the CMG. Because of this the it must have a 
communications path to all Functional Units for both tha reading and writing 

of data. 

The three prototype multiprocessor operating system, previously 
mentioned have all had different hardware mappings. Each new mapping was 
guided through obsarvaitona made in tha development of tha previous mapprng. 

tha currant mapping all three logical component, are distributed wrthrn 
each hardware module. *e hardware module, chosen are IBM PC/AT', and are 
connected on an Ethernet Local Area Network This mapping present, two 
advantages over the previous two in which the logical components were not 
completely distributed. Firat, the redundancy of all logical components 
provides a greater degree of fault tolerance. Secondly, a reduction of 
inter-module communications, the major bottleneck in multiprocessor desrgn, 
i, expected as the logical components all reside in the same hardware 

module . 
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The final step in the design process is to develop a multiprocessor 
operation system to implement the logical components as designated by the 
hardware mapping. In addition to the hardware modules, a Sink/ Source node 
module was designed for the system initialization and monitoring. It is 
also responsible for injecting input data into the system and for receiving 
output data. The resulting multiprocessor has been successfully developed 
and is currently undergoing tests for ATAMM validation. Initial results are 
positive and all tests should be completed by the end of August. 
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XII. 0 OPTIMUM TIME PERPORMABCE 


III.l Introduction 

The development of a new graph theoretic model for describing the 
relation between a decomposed algorithm and its execution in a data flow 
environment is presented. Performance measures of computing speed and 
throughput capacity are defined. Lower bounds for these performance 
measures are established. In Subsection III. 2 of this report, the modeling 
process to describe algorithms in data flow architectures, ATAMM, is pre- 
sented. The model consists of three Petri net marked graphs called the 
algorithm marked graph (AMG) , the node marked graph (NMG) , and the compu- 
tational marked graph (CMC). In Subsection III. 3, the operating character- 
istics of these graphs are investigated. A state variable description is 
presented and used to establish sthe graph properties of reachability, live- 
ness and safeness. Time performance measures for concurrent processing are 
defined in Subsection III. 4. Hie ATAMM model is used as the basis for 
calculating analytically lower bounds for these performance measures. Then 
in Subsection III. 5, an operating strategy which achieves optimum time per- 
formance is developed. Several exa mples are presented to illustrate these 
concepts . 

III. 2 ATAMM Model Development 

In this subsection the ATAMM model to describe concurrent processing of 
decomposed algorithm is presented. The model consists of a set of Petri 
net marked graphs which incorporate general specifications of communication 
and procesaing associoated with each computational event in a data flow 
architecture. First, a detailed description of the problem context is 
atated. This is followed by the definition of the ATAMM model consisting of 
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the algorithm marked graph, the node marked graph, and the computational 
marked graph. Some familiarity with Petri nets [10] and marked graphs [11] 
is assumed in this presentation. 

The problems of interest are decision-free, computationally complex 
problems as are often found in signal processing and control applications. 

A problem description normally results in the definition of a function given 
by the triple (X,Y,F). The set X represents the set of admissible inputs, 
the set Y represents the set of admissible outputs, and F:X->Y is the rule 
of correspondence which unambiguously assigns exactly one element from Y to 
each element of X. Associated with a computational problem is one or more 
algorithms. An algorithm is an explicit mathematical statement, expressed 
as an ordered set of primitive operations, which explains how to implement 
the rule of correspondence F. In general, a given problem can be decomposed 
by several different primitive operator sets. Also, for a given primitive 
operator set, there are often different orderings of primitive operations 
which can be specified to carry out the problem. Of special interest are 
algorithm decompositions in which two or more primitive operations can be 
performed concurrently. For such decompositions, the potential exists for 
decreasing the computational time required to solve the problem by increas- 
ing the computational resources which implement the primitive operations 
program storage and temporary input and output data containers. 

The hardware environment for executing the decomposed algorithms is 
assumed to consist of R identical processors or functional units (FUNs) 
where R has a value in the range of two to twenty. lhis range of resources 
is suggested for practical reasons due to the large-grained aspect of the 
algorithm decomposition and the need to maintain small communication times 
relative to process times. Each FUN is a processor having local memory for 
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Each FUN can 


program storage and temporary input andoutput data containers 
execute any algorithm primitive operation. The FUNs share a common global 
memory (GLM) which may be either centralized or distributed. The coordina- 
tion of FUNs in relation to data and control flow is directed by the graph 
manager <-GRM) . The GRM also may be centralized or distributed. Output 
created by the completion of a primitive operation is placed into global 
memory only after the output data containers have been emptied. That is, 
outputs must be consumed as inputs to successor primitive operations before 
allowing new data to fill the output locations. Assignment of a functional 
unit to a specific algorithm primitive operation is made by the GRM only 
when all inputs required by the operation are available in global memory and 
a functional unit is available. 

An algorithm marked graph is a marked graph which represents a specific 
algorithm decomposition. Vertices of the algorithm graph are in a one-to- 
one correspondence with each occurrence of a primitive operation. The algo- 
rithm graph contains an edge (i,j) directed from vertex i to vertex j if the 
output of primitive operation i is an input for primitive operation j. Edge 
(i,j) is marked with a token if an output from primitive operator 1 is 
available as an input to primitive operator j. When constructing an algo- 
rithm graph, vertices (primitive operations) are displayed as circles, and 
edges (input-output signals) are displayed as directed line segments con- 
necting appropriate vertices. The presence of a token on an edge is indica- 
ted by a solid dot placed on the edge. Source transitions and sink transi- 
tions for input and output signals are represented as squares. Sources for 
constants are not usually included in the algorithm marked graph; however, 
triangles are used for this purpose when necessary. 


11 


To illustrate the construction of an algorithm marked graph, consider 
the problem of computing the output of a discrete linear system given a 

sequence of inputs to the system. Let the system be described by the state 
equation 

x(k) = Ax(k-l) + Bu(k) 

and output equation 

y(k) = Cx(k) . 

where x is p-vector, us is an m-vector, and y is an r-vector. The primitive 
operations are defined as matrix multiplication and vector addition, and the 
natural algorithm decomposition resulting from the state equation descrip- 
tion is selected. The algorithm marked graph for this decomposed algorithm 
is shown in Fig. 1. The initial marking indicates that initial condition 
data are available. 

The algorithm marked graph is a useful tool for representing decomposed 
algorithms and for displaying data flow within an algorithm. However, the 
algorithm graph does not display procedures that a computing task. In addi- 
tion, the issues of control, time performance, and resource management are 
not apparent in this graph. These important aspects of concurrent process- 
ing are included in the ATAMM model through the definition of two additional 
graphs. The node marked graph (NMG) is defined to model the execution of a 
primitive operation. The computational marked graph, obtained from the AMG 
and the NMG by a set of construction rules, integrates both the algorithm 
requirements and the computing environment requirements into a comprehensive 
graph model. These additional marked graphs are defined in the following. 

The NMG is a Petri net representation of the performance of a primitive 
operation by a functional unit. Three primary activities, reading of input 
data from global memory, processing of input data to compute output data, 
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and writing of output data to global memory, are represented as transitions 
(vertices) in the NMG. Data and control flow paths are represented as 
places (edges), and the presence of signals is notated by tokens marking 
appropriate edges. The conditions for firing the process and write tran- 
sitions of the NMG are as defined for a general Petri net, while the read 
transition has one additional condition for firing. In addition to having a 
token present on each incoming signal edge, a functional unit must be avail- 
able for assignment to the primitive operation before the read node can 
fire. Once assigned, the funcitonal unit is used to implement the read, 
process, and write operations before being returned to a queue of available 
FUNs. The initial marking for an NMG consists of a single token in the 
"process ready" place. The NMG model is shown in Fig. 2. 

A computational marked graph (CMG) is constructed from the AMG and the 

NMG by the following rules. 

1 . source and sink nodes in the algorithm marked graph are represented 

by source and sink nodes in the CMG. 

2. Nodes corresponding to primitive operations in the algorithm marked 

graph are represented by NMGs in the CMG. 

3. Edges in the algorithm marked graph are represented by edge pairs, 
one forward directed for data flow and one backward directed for 
control flow, in the CMG. The initial marking for the edge pair 
consists of a single token in the forward-directed place if data 
are available, or a single token in the backward-directed place if 
data are not available. 

The play of the CMG proceeds according to the following graph rules. 

1. a node is enabled when all incoming edges are marked with a token. 

An enabled node fires by encumbering one token from each incoming 
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edge, delaying for some specified transition time, and then depos- 
iting one token on each outgoing edge 

2. A source node and a sink node fire when enabled without regard for 
the availability of a FUN. 

3. A primitive operation is initiated when the read node of an NMG is 
enabled and a FUN is available for assignment to the NMG. A FUN 
remains assigned to an NMG until completion of the firing of the 
write node of the NMG. 

In order to illustrate the construction of a computational marked 
graph, the CMG corresponding to the algorithm marked graph of Fig. 1 is 
shown in Fig. 3. The computational marked graph is useful because it clear- 
ly displays the data and control flow which must occur in any hardware 
implementation of the model process, and because it clearly displays the 
data and control flow which must occur in any hardware implementation of the 
model process, and because it provides a hardware independent context in 
which to evaluate process performance. 

The complete ATAMM model consists of the algorithm marked graph, the 
node marked graph, and the computational marked graph. A pictorial display 
of this model is shown in Fig. 4. In the next subsection, important oper- 
ating characterists of the ATAMM model are investigated. 

III. 3 Model Characteristics 

In the previous subsection, a marked graph model consisting of the AMG, 
the NMG, and the CMG is defined as a means to describe concurrent processing 
of decomposed algorithms. In this subsection the ATAMM model is studied 
analytically to determine important graph operating characteristics. First, 
a state description which expresses the next graph marking as a function of 
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the present marking and a vector indicating which transition is to be 
is developed . Thee, the marked Staph properties of teachability, Uveoess, 
and safeness are considered for the CMC. Two excellent papers by Muraca 
[11], [12] on properties of marked graphs are the source for much of the 

material presented in the subsection. 

Let G be a marked graph consisting of m places and n transitions. Th 

m-vector denotes the marking vector for G resulting from the firing of 
some sequence of k transitions. The following two definitions are necessary 

to develop the state description of the CMG. 

n-^Hnn 1: Complete Incidenc ejl^ complete incidence matrix for 

, ( \ matrix A * [a 1 having rows corresponding to 

a marked graph G is the (nxm) matrix A i j 

transitions, columns corresponding to places, and where 

+1C-1) if place j is incident at transition i 

and directed out of (into) the transrtron 


a . . = 


if place j is not incident at transition j 

~s 2: Elementary Firing Vector . An elementary firing vector u fe is 

an n-vector having all tero entries except for the ith component which is 1 
denoting that transition i is the kth transition to fire in some transition 

firing sequence. 

To gain insight to the state equation description, it is helpful to 

. . . T f a -1(+1). place i is an input 

consider the firing of transition k. a ki 

(output) place to transition k. therefore, transition k is enabled if 
„ U ) . i for each input place. When transition k fires, one token is 
removed fro. each input place and one token is added to each output place. 
These observations lead to the following next state description for a marked 

graph. 
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Property 1: — Next State Description . For a marked graph G with present 

marking vector M fe _ 1 and elementary firing vector u k> the next marking vector 
is given by 


V \-i +aT, v 

The next state description can be used to express the graph marking 
resulting from the application of sequences of elementary firing vectors. 
This is done in the next definition and property. 

Definition 3: — Firing Count Vector . Let be a sequence of 

elementary firing vectors taking a marked graph G from an initial marking 
to a destination marking M d> The firing count vector x^ for this firing 
sequence is defined by 


x <* ‘ J, v 

k=* 1 

Proper ty 2: State Equation Description . For a marked graph G with initial 

marking vector , the marking vector resulting from the application of 
elementary firing vector sequence (u^ u 2> ...,u d ) is given by 

T 

M 3 M + A Y 

d 0 d‘ 

Using the state description of a marked graph as a basis, the property 
of reachability is investigated. Necessary and sufficient conditions for a 
CMG marking vector to be reachable from an initial marking are established, 
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and it is shown that the number of tokens contained in any directed circuit 
of the CMG is invariant under transition firings. 

Definition 4: Reachability . A marking M d is reachable from an initial 

marking M Q if there exists a sequence of elementary firing vectors that 

transforms to M . 

0 fl 

The following definition is required to state the reachability condi- 
tions for a CMG. 

Definition 5: Fundamental Circuit Matrix . Let T be a tree of a connected 

marked graph G. The set of (m-n+1) circuits, each uniquely formed by 
appending one cotree edge to the tree, is called the set of fundamental 
circuits of G for tree T [13]. The fundamental circuit matrix for G for 
tree T is the2(m-n+l x (m) matrix B f * [b^j ] having rows corresponding to 
fundamental circuits, columns corresponding to places, and where 

if place j is containedin f-circuit i and 
the place and circuit drections agree 
(disagree) 

if place j is not contained in f— circuit i. . 

Property 3: Reachability in the CMG. In a computational marked graph G, a 

marking M d is reach ;able from an initial marking M Q if and only if B f M d = 

B M . where B, is a fundamental circuit matrix for G. 
f 0 f 

Proof. It is shown in [11] (Theorem 3) that the property is true for marked 
graphs containing no token-free directed circuits. By the construction 
rules for the CMG, directed circuits occur in exactly four ways. First, 
each NMG consists of a directed circuit which contains an initial marking 
token in the "process ready" place. Second, a directed circuit is formed 
each time an NMG is linked to another NMG. Since one of the two linking 


+ 1 (- 1 ) 

b. . = 

0 
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places contains an initial marking token and both places are contained in 
the circuit, this circuit is never token free. Third, directed circuits 
exist in the CMG corresponding to interconnected feedforward paths in the 
algorithm marked graph. Each such circuit contains one or more backward- 
directed control edge containing one initial marking token. Fourth, 
directed circuits exist in the CMG corresponding to directed circuits in the 
algorithm marked graph. Each such circuit contains exactly one forward- 
directed edge containing one initial marking token representing initial 
condition data. Therefore, the CMG contains no token-free directed circuits 
and the property follows. 

As a direct consequence of the reachability property of the CMG, it can 
be shown that the number of tokens in any directed circuit is constant. 

This characteristic is stated as Property 4. 

Pr operty 4: Token Count Invariance . In a CMG, the number of tokens con- 

tained in a directed circuit is invariant under transition firing. 

Proof. Consider a directed circuit C of a CMG. The entries in the row of a 
circuit matrix B corresponding to C are ±1 in columns representing edges in 
C and are 0 otherwise. If M is a marking vector, the component of BM 
corresponding to C is equal to the number of tokens in directed circuit C 
under marking M. Therefore, if M d is any marking reachable from an initial 
marking M Q , it follows from Property 3 that BM^ - BM Q . That is, the number 
tokens in directed circuit C under initial marking M^ is equal to the number 
of tokens under any marking M^ reachable from M^. This completes the proof . 

Next, liveness and a closely related property called consistency are 
considered. It is shown that the CMG is live and consistent. 
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n-si-ltionb: Uveness. * C 9 “ d “ “ * “““"* 

„ if, for all markings reachable from M, it is possible to fire any tran- 
sition of G by progressing through some transition firing sequence. 
nropertv 3: Li vanes, in the CHG._ The computational marked graph rs live 

for all appropriate initial markling vectors. 

Proof It is shown in [121 (Property 2) that a marked graph G is live for a 
marking « if and only if 0 contain, no token-free directed circuits in mark- 
ing M. As stated in the proof of Property 3, for all appropriate initial 
markings « Q , the CMG contains no token-free directed circuits. Hereford, 

the property follows . 

n-ni.lelnn 7: Consistency. A marked graph G is said to be consistent if 
there exists a marking M and a transition firing sequence S from M back to M 
such that every transition occurs at least once is S. 

nropertv 6: Consistency in CMG . A connected computational marked graph G 

is consistent. In addition, each transition of G occurs an equal nmaber of 
times in a firing sequence from a marking H back to M. 

Proof. Prom Property 2. if a CMG is cosistent, then there exists a marking 
M . „ and a firing count vector x, > 0 such that a\ - 0. *e converse 
it also true. The incidence matrix for a marked graph 0 is an (n x m) 
matrix A. If G is connected, then it is know, [131 that the rank of A is n- 
i. and thus the null space of A* ha, dimension one. It is observed that 
each row of A* has one (1), one (-1). and all remaining terms are (0). 
Therefore, if denotes the j column of A it 


n 

I C j 

j-1 
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Thus, there exists a vector x = fk k kl T v -> n 

d *■ * • kj , k ? U, which uniquely satis- 

. T 

fies A x^ = 0. This completes the proof. 

The final graph property considered in this section is safeness. This 
property is first defined, and then it is shown that CMG is safe. 

De finition 8: Sa feness . A marked graph G is said to be safe for marking M 

if, for all markings reachable from M, no place contains more than one 
token. 

Property 7: Safeness in the CMG. The computational marked graph is safe 

for all appropriate initial marking vectors. 

Proof. By Property 4, the token count for each directed circuit of the CMG 
is invariant under transition firing. Therefore it is sufficient to show 
that each edge of the CMG belongs to at least one directed circuit contain- 
ing a single token. By the construction rules for the CMG, all CMG edges 
can be classified into two groups, NMG edges and linking edges. NMG edges 
occur in groups of three and always form a directed circuit containing one 
token. Linking edges occur in pairs, one forward directed and one backward 
directed, and also form a directed circuit with the forward directed edges 
of the NMG. One of the linking edges, but not both, always contains one 
token while the forward directed edges of the NMG contain no tokens. There- 
fore, each edge of the CMG is contained in a directed circuit with one 
token, and the property follows. 

III. 4 Performance Analysis 

The importance of the ATAMM model is that it establishes a context in 
which to investigate the performance of decomposed algorithms in multipro- 
cessor data flow architectures. In this subsection, performance measures 
indicating computing speed and throughput capacity are defined. Bounds for 
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these quantities are calculated analytically Iron, the algorithm marked graph 
and the computational marked graph. This information is essential for effr- 
ciently matching algorithm decompositions with architecture implementations. 

The work presented in this subsection is an interesting application and 
extension of recent investigations of the performance of Petri Nets [14], 

[15] and marked graphs [16]. 

It is assumed that a decomposed algorithm is implemented in a multipro- 
cessor architecture containing R computing resources or functional units. 

Each functional unit is capable of performing any of the primitive oper- 
ations whose sequence defines the decomposition. A computational task con- 
sists of completing the algorithm for one frame of data and is initiated 
when an input data token from the source node is encumbered. Task output 
occurs when a corresponding output data token is deposited at the output 
sink node. A task is completed when all computing associoated with the task 
is completed. It should be noted that task output and task completion do 
not always coincide. In many iterative signal processing algorithms, com- 
puting to generate initial conditions for the next iteration often 
after an output has been calculated. Task completion is usually indicated 
in the AMG or CMC by the return of the graph to some steady-state initial 
marking. To facilitate measurement of throughput capacity, it is assumed 
that tasks are repeated periodically with new input data sets. New data 

sets are available continuously as input tokens from the input source n 

i i horflf 1 iup algorithms where the present 

Included in this problem class are iterative algoncnm 

task requires as inputs data from previous task calculations. 

Concurrency in this problem setting occur, in two ways. First, differ- 
ent functional units may perform simultaneously several primitive operations 
belonging to a single task. This type of concurrency is referred to as 
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vertical concurrency. Vertical concurrency has a direct effect on task 
computing speed. It is limited by the number of primitive operations that 
can be performed simultaneously in a given algorithm decomposition, and by 
the number of functional units available to perform the primitive opera- 
tions. Second, different functional units may perform simultaneously r 
primitive operations belonging to different tasks sequentially input to the 
computing system. Called horizontal concurrency, this type of concurrency 
has a direct effect on throughput capacity. It is limited by the capacity 
of the graph to accommodate additional task inputs, and by the number of 
functional units available to implement the tasks. In the following it is 
shown that the process of algorithm decomposition imposes bounds on the 
amount of vertical concurrecy and horizontal concurrency possible in a given 
problem. If sufficient computing resources are available, operation at 
these bounds can be achieved. If the number of computing resources is limi- 
ted, the bounds cannot be reached simultaneously and trade-offs between the 
amount of vertical concurrecy and horizontal concurrency are possible. 

Three performance measures for concurrent processing are defined. The 
first two parameters, TBIO and TT, are indicators of computing speed and 
reflect the degree of vertical concurrency. The third parameter, TBO, is a 
measure of throughput capacity and thus reflects the degree of horizontal 
and vertical concurrency. 

Definition 9: TBIO. The performance measure TBIO is the computing time 

which elapses between a task input and the corresponding task output. 
Definition 10: — TT. The performance measure TT is the computing time which 
elapses between a task input and the completion of all computation associ- 
ated with that task. 
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naflni t-i on 11: TBO. The performance measure T B 0 is the computing eta. 

which elapse, bet.«n successive cash outputs when the graph is operating 

periodically in steady-state. 

The remainder of this section is devoted to developing lower bounds for 
these performance measures. 

Ut G denote an algorithm marked graph representing as decomposed 
algorithm. The lower bound for TB10 is the shortest time required for a 
data token from the data input source to propagate through the graph to the 
data output sink. Similiarly, the lower bound for TT is the shortest time 
required to complete all computing activity initiated by the injection of a 
data input source. These shortest times are the actual performance times 
when only a single task is active in the graph during any time interval (no 
horizontal concurrency), and as many computing resource, as are required are 
available (maximum vertical concurrency). Under these operating conditions, 
Tower bounds for T3I0 and TT are calculated by identifying certain longest 
paths in a graph obtained from the algorithm marked graph. This new graph, 
called the modified algorithm graph G^, is defined and then used to 
determine lower bounds for TBIO and TT. 

Definition 12: Modified Algori thm Graph. Let p £ be a place of G, 

from transition t r to transition t., which contains a token of the initial 

marking. The modified algorithm graph Gj, is obtained from the graph G by 

the following construction rules. 

1. Place p. is deleted from G. 

2. A new place p.^ directed from the data input source to transition 
t , is added to G. 

3. g’new output sink s. different from all other output sinks, and a 
new place p i2 , directed from transition t r to Sj, are added 
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4. The above rules are repeated for each place of G containing a token 
of the initial marking. 

Lower bounds for TBIO and TT are presented in Theorem 1 and Theorem 2 
respectively. 

Theorem 1: Lower Bound for TBIO. Let P be the i*"* 1 directed path in G 

I M 

from the data input source to the data output sink, and let T(P,) denote the 

x 

sum of transition times for transitions contained in P. . Then, 

i 

TBIO lb = Max { T(P.)} , 

where the maximum is taken over all paths P. graph G . 

i M 

Proof. Without loss of generality, let t f be the last transition in all 
paths P^ directed from the data input source to the data output sink. Tran- 
sition t^ is enabled when each input place for t^ contains a token. Since 
by assumption a computing resource is available, t f fires as soon as it 
becomes enabled. Let p^ be the last input place for t^ to acquire a token, 
and let t^ be the input transition for place p^ . Continuing this labeling 
procedure results in a backward path construction process. This process is 
repeated, first at t , and then at each succeeding transition until the data 

5 

input source is reached, identifying a path p ^ . By the construction process 

for the path, it is clear that T(P^) = Max { T(P i )} , where the maximum is 

over all paths P. in G u . It is also clear that TBIO t „ can be no shorter 
i M LB 

than T(P ) so that TBIO > T(P.). Since a computing resource is available 
J j 

when each transition in P^ is enabled, the time between input and corre- 
sponding output can be no longer than T(P.) so that TBIO < T(P ). There- 

3 LB j 

fore, TBIO^ = T(P^ ) = Max { T( P^ )} , where the maximum is over all paths P. 
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in G . This completes the proof. 

M 

Theorem 2: Lower Bound for IT . Let P. be the i C directed path in G^ from 

the data input source to any output sink, and let T(P^) denote the sum of 
transition times of transitions contained in P^ Then, 

TT td = Max { T(P . ) } 

LB 1 

where the maximum is taken over all paths P^ in graph G^. 

Proof. By the construction rules for graph G^, a task is initiated when 

input data tokens are input from the data input source, and is completed 

when all output sinks have accepted tokens. Therefore, TT is the time which 

elapses from injection of input tokens to the arrival of a token at the last 

fired output sink. Let T(P t ) - Max{T(P.)}, P. in be the longest path 

time of paths from the data input source s ] . to any output sink, say s f . 

Since a token must reach sink s t before a task is completed, it follows that 

TT > T(P ). Since a resource is available for each transition to fire 
LB t 

when enabled, and since P fc is the longest path in G^, it also follows that 

TT < T(P ). Therefore, TT. - - T(P.) = Max{T(P.)}, where the maximum is 
LB t u 

over all paths P. in G w . This completes the proof. 

1 M 

To illustrate the application of Theorem 1 and Theorem 2, TBIO LB and 

TT are computed for the algorithm graph shown in Fig. 1. For this exam- 
LB 

pie, the following transition times are assumed: T(l) = 4, T(2) = 1, T(3) 

5, and T(4) = 6. The modified algorithm graph coresponding to Fig. 1 is 
shown in Fig. 5. The modified algorithm graph contains two paths directed 
from the data input source Sj to the data output sink s Q . Path Pj consists 
of edge set { 1, 2, 3, 4} with TCP^ = 10, and path ? 2 consists of edge set 
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{5-1, 3, 4} With T(P 2 ) = 6. Therefore, since T(P^) > T(P 2 ), path P^ deter- 
mines the lower bound for TBIO and TBIO^ = 10. The modified algorithm 
graph contains two additional directed paths from the data input source s 
to the output sink s^. Path P^ consists of edge set { 1, 2, 6, 5-2} with 
T(P^) = 11, and path P^ consists of edge set {5-1, 6, 5-2} with T(P^) = 

7. Since T(P 3 > > TCP^) > T(P^) > T(P 2 ), path P 3 determines the lower bound 

for TT and TT = 11. 

LB 

Next a lower bound for the performance measure TBO is presented. Let G 


be a computational marked graph representing a decomposed algorithm. It is 
assumed that operating conditions for G are set to maximize horizontal con- 
currency. That is, data tokens are continuously available at the data input 
source, and as many computing resources as needed can be called to perform 
primitive operations. With these conditions, the graph plays periodically 
in steady-state, and TBO R is the shortest time possible between successive 
outputs . 

Theorem 3: Lower Bound for TBO. Let G be a computational marked graph and 


let C . be the ith directed circuit in G. The notation T(C,) denotes the sum 
1 i 

of transition times of transitions contained in C^, and M(C^) denotes the 
number of tokens contained in C. . Then, 

l 


TBO t _ = Max { T(C. )/M(C. )} , 

JjD 1 1 


where the maximum is taken over all directed circuits in G. 

Proof. Without loss of generality, let t^ be the output transition in G so 
that an output is produced each time t ^ completes the firing. Then TBO^ is 
the minimum firing period of transition t^. By Property 6, G is consistent 
so that all transitions of G fire periodically with minimum period tbo lb ' 
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It is shown in [12] (pp. 58-60) that the minimum firing period of each tran- 
sition of a marked graph is given by Max{ T(C.) /M(C.)} , where the maximum is 
taken over all directed circuits in G. Therefore, the theorem follows. 

The computational marked graph shown in Fig. 3 is used to illustrate 
Theorem 3. This CMG contains many directed circuits. However, the directed 
circuit which contains all NMG nodes of transitions 2 and 4 contains only 
one token and maximizes the ratio T(C.)/M(C.). Therefore, the shortest time 
possible between successive outputs in this graph is TBO^ = 111 the next 

subsection, a strategy for achieving optimum time performance is investi- 
gated . 

III. 5 Strategy for Optimum Time Performance 

A model describing decomposed algorithms for implementation in a dis- 
tributed data flow architecture is described in Subsections III. 2 and III. 3, 
and performance measures are defined in Subsection III. 4. An important 
problem remaining is to develop an operating strategy for the ATAMM model 
which achieves optimum time performance with a minimum number of computing 
resources. Unfortunately, this problem is equivalent to a class of schedul- 
ing problems which is known to be NP-complete. Thus, there exists no algo- 
rithm for obtaining an optimum solution which is better than enumerating all 
possible solutions and then choosing the best one. However, an important 
subopt imal operating strategy which achieves optimum time performance, but 
possibly requires more than the minimum number of computing resources, has 
been developed. This strategy is presented and illustrated by example in 
this subsection. 

When presented with continuously available input data sets, the natural 
behavior of a data flow architecture results in operation where new data 
sets are accepted as rapidly as the available resources permit. That is, 
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the architecture naturally operates at high levels of horizontal concurrency 


with the possible loss of capability for achieving high levels of vertical 

concurrency. This results in performance characterized by high throughput 

rates, TBO=TBO , but relatively poor task computing speed so that TBIO > > 
LB 

TBIO and TT > TT . In many signal processing and control applications, 
LB LB 

it is important to achieve both high throughput rate and high task computing 
speeds. Often, designers are willing to provide extra hardware to realize 
optimum time performance. The suboptimal operating strategy presented in 
this section results in performance having the following characteristics. 

1. When R > R Max > operation achieves TBIO^, and TBO^. 

computed in implementing the strategy, and represents the minimum 
number of resources which insures maximum horizontal concurrency 
and maximum vertical concurrency under this strategy. 

2. When R w > R > R__. , operation achieves TBIO and TT , but TBO 

Max Mm LB LB 

> TBO . The strategy preserves task computing speed or vertical 
LB 

concurrency at the expense of throughput rate or horizontal con- 
currency. R . is also computed in implementing the strategy, and 
Min 

represents the minimum number of resources needed to maintain 
vertical concurrency with limited horizontal concurrency. 

3. When R . > R > 1, operation continues but performance degrades so 

Mm 

that TBIO > TBIO^, tt > tt t «> and TB0 > TBO tr- 


LB 


LB 1 


Implementation of the operating strategy is illustrated in Fig. 6. All 
that is required is to limit the rate at which new input data are presented 
to the CMG. This is accomplished by adding a control transition connected 
in a directed circuit with the data input source. The control transition 
imposes a minimum delay of D time units between inputs. Delay D is chosen 
according to the following rule: 
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tbo lb 

» = TB °Min 

TCE 


R > R u 

Max 

R > R > R . 

Max Min 

R < R > I- 

Min 


rpnuired to complete the task, and 
TCE denotes the total computing effort required P 

TBO R„ , and V are computed as part o£ the strategy design proce- 

1 Min ’ Max ’ ™- n 

dure . 

The operating strategy design process consists of five steps. These 
steps are presented and explained in the remainder of this subsectron. 
operating strategy is developed for the example algorithm graph shown in 
Fig. 7 to illustrate each step as it is presented. 

stepj.. Choose a convenient transition firing rule. A rule to determine 
when an enabled transition in the CMC fires must be specified. A natural 
rule is to specify that enabled transition, fire when a computing resource 
is available. If conflict exists, such as when there are more enabled 
transitions than computing resources, then firing occurs according to a 
priority ordering of the transitions. For the example algorithm graph, the 
highest to lowest priority ordering of the transitions is chosen as <5,4,3,- 

Step 2. Determine TBO LB . The performance bound TBO^ is found from the 
computational marked graph by application of Theorem 3. The CMC correspond- 
ing to the example algorithm graph is shown in Fig. 8. ™e directed circuit 

identified in this figure contains 6 transition time units and 2 tokens, 

rvtr WM (r 1 for all directed circuits. Therefore, 
and maximizes the ratio T(C.)/M(C.) for ail a 

TBO lb = 3 * , , 

Step 3 . Determine the resource utilization envelope of a sing e 

Required for maxima vertical concurrency at steady-state with TBO - XBO^. 
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The purpose of this step is to determine the number of computing resources 
required as a function of time to achieve maximum vertical concurrency in a 
single task. The envelope is determined by playing the graph assuming 
unlimited resources and an input rate of TBO until steady-state operation 
is reached. The resource utilization envelope is obtained by counting the 
number of computing resources used for a single task during each time 
interval. The play of the example algorithm graph under these conditions is 

shown in Fig. 9, and the resulting resource utilization envelope is shown in 
Fig. 10. 

Step_4. Stabilize the resource utilization envelope by adding control 
places as necessary. If the time between inputs to the CMG is increased 
above TBO^g, the resource utilization envelope may change from that observed 
in Step 3. Since knowledge of the envelope is required to calculate the 
number of required resources, additional places are appended to the AMG and 
the CMG to freeze the shape of the envelope. For example, the play of the 
example algorithm graph of Fig. 8 with an injection time of 4 is shown in 
Fig. 11. At this slower injection rate, transition 6 fires one time unit 
earlier. To prevent time movement of transition 6, a control place directed 
from transition 2 to transition 6 is added. This place prevents the firing 
of transition 6 until transition 2 has completed firing. Thus the resource 
utilization envelope computed for an input period of TBO is the envelope 
for all input periods TBO > TBO 

LB 

Step_5. Compute R^, R^, and TB0 M - n (R) using the resource utilization 
envelope. is determined by overlaying resource utilization require- 

ments, each delayed by TBO^ with respect to the previous one, as shown in 
Fig. 12 for the example. ^ ax *- s equal to the largest resource requirement 
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during any time interval within the steady state operating period. is 

the minimum number of resources necessary to insure maximum vertical con- 
currency with no horizontal concurrency. This number is equal to the maxi- 
mum resource requirement indicated in the resource utilization envelope for 
a single' task. For the example problem, = 5 and R^ = 3. The value 

of TBO . for each resource number R between R^ an< * inclusive, is 

Min 

determined by increasing the delay between overlapping resource utilization 

envelopes until the maximum resource requirement is R. TB0 M i n 1S the small_ 

est input delay to produce this resource requirement. For the example, the 

calculations of TBC>. for R = 4 and R = 3 are illustrated in Fig. 13 and 

Min 

Fig. 14 respectively. The results of these calculations are TB0 Min ( 4 ) = 3,5 

and TBO . (3) = 4. 

Mm 

The performance of the example algorithm graph is summarized in Fig. 

15. Optimum time performance of TBI0 LB = TT LB = 7 and TB °LB = 3 13 achieved 

for R > R =5- At R = 4, TBIO and TT remain at the optimum values and 

Max 

TBO decreases to 3.5. At R = 3, TBIO and TT again remain at the optimum 
Min 

values and TBO . decreases to 4. For values of R below R^ , time perform- 
Min 

ance generally degrades. However, in this example TBIO and TT remain at 7 

for R = 2, while TBO . decreases to 6. Finally, at R = 1, performance 

Mm 

degrades to TBIO = TT = TBO = TCE - 10. Another perspective of algorithm 
performance is shown in Fig. 16. This figure displays throughput rate, 
(1/TBO), as a function of the number of functional units R. The peak height 
of each bar indicates the maximum throughput rate which can be achieved with 
the indicated number of processors. The bars also indicate more clearly 
that operation at any throughput rate less than maximum is possible for a 
given number of functional units. This design procedure is easily applied 
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to much larger algorithm graphs more representative of actual signal 
processing and control problems. 
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IV. 0 DIAGNOSTIC TOOL DEVELOPMENT 

IV. 1 Analyzer Development 
IV . 1 . 1 Introduction 

Concurrent processing is the capability of a computer system to execute 
two or more tasks at the same time. For example, a processor may execute a 
given computation at the same time that an I/O coprocessor performs an I/O 
operation. There are new computer architectures that organize processors in a 
parallel fashion requiring customized algorithms to take advantage of the 
parallelism of the systems. However, the models developed to describe these 
architectures do not adequately model the issues of scheduling, coordination, 
and communication (Ref. 17). On the other hand, the strategy proposed by 
Stoughton and Mielke (Ref. 17-19) addresses these particular issues. The 
strategy uses timed Petri nets (Ref. 20) to model processor behavior for each 

computational node of an algorithm graph. 

Detailed data are needed to evaluate and study the performance of a 
concurrent processing system. Data such as the function of concurrency with 
respect to time can be investigated. Therefore, a sophisticated evaluation of 
the concurrent system can be performed. To achieve this objective, it is 
indispensable that data, such as when the processing of a data packet is 
initiated and when it is terminated, be available. Performance measures such 
as TBIO or TBO can be obtained from global information such as when an input 
is read by the graph and when its corresponding output is written. This kind 
of information can be obtained from an outside observer which monitors the 
system. The best information the system is able to provide is the firing of 
every transition of every node during execution. With these data, a more 
comprehensive study of a concurrent processing system can be done. Although 
the system itself is used to provide the information, it does not affect the 
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performance of the system due to the relatively low speed communication chan- 
nels used in the prototype. Another method to probe the system should be 
devised if high speed communication channels are to be used. This chapter 
describes an analyzer system that yields the required evaluation. In Sub- 
section IV. 1.1, a prototype system and its communication events will be close- 
ly examined. What the Diagnostic Routines do in the Graph Manager and their 
effect in the overall performance is contained in Subsection IV. 1.3. How 
information of internal events is recorded is presented in Subsection IV. 1.4. 
In Subsection IV. 1.5, generalities of the Analyzer program are examined, 
including what information is input to it and what is obtained as output data. 
In Subsections IV. 1.6, IV. 1.7 and IV. 1.8, how the Analyzer program processes 
this output data to generate measurement information is presented. These 
measurements include TBIO (Time between Input and Output) , TBI (Time between 
Inputs), TBO (Time between Outputs), concurrency of the computing system and 
general average process times. In the last two Subsections IV. 1.9 and 
IV. 1.10, a different tool is presented. This tool integrates the simulation 
•of the system and the analyzer in one program. 

IV. 1.2 Prototype and its Communication Events 

A prototype of a concurrent processing system was developed. It was 
used to prove some of the theories of the graph representation of such systems 
and to establish a basis for comparison of the simulation program to its 
hardware counterpart. The block diagram of the prototype, which was origi- 
nally presented in (Ref. 17), is shown in Fig. 17. 

The system consists of several S-100 units using Intel 8088 micropro- 
cessors. Each unit has I/O boards to communicate with the external world as 
well as 32k of random access memory (RAM). For test purposes, there are three 
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units acting as processing elements or Functional Units, one as the Graph 
Manager and one that serves as Global Memory. The communication between them 
is made through serial ports (Standard RS-232). An IBM Personal Computer XT 
(IBM- PC/XT) is used to communicate with the Graph Manager. A communication 
channel can be set through the Graph Manager to the Functional Units and the 
Global Memory. 

The Graph Manager is designed to monitor the graph execution and is 
itself controlled by the data flow in the system. The Graph Manager keeps a 
record of the places in the graph as well as which functional unit is per- 
forming which process node. The Graph Manager "schedules" the assignment of a 
functional unit to a process node according to the priority of the nodes, 
functional units available and the process nodes that can be fired. 

A serial communication link is set between the Graph Manager and every 
Functional Unit. A link is also set between the Global Memory and every 
Functional Unit. Serial communication between the IBM-PC/XT and the Graph 
Manager is used for initialization, and for controlling and monitoring of the 

system. 

When a node which is found that can be fired, i.e., its input places are 
full and its output places are empty (the last requisite for single node model 
only), such node is assigned to a Functional Unit; i.e., that node is fired. 

To fire a node, a communications protocol is initiated between the Graph 
Manager and an available Functional Unit, as shown in Figure 18. This proto- 
col begins with the code word D for Do; it is followed by a Task Number, the 
Inputs places or labels, and the Output places. This communication event is 
called Assign Task. This information, which is given to the Functional Unit, 
is taken from the graph data that are in the Graph Manager's memory. In this 
step a task or a node is said to be assigned to a Functional Unit. 
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The next piece of communication done between the assigned Functional Unit 
and the Graph Manager is the acknowledgement from the Functional Unit that the 
input places have been read (Acknowledge Input) . The Functional Unit reads 
the data from the Global Memory using another protocol before Acknowledge 
Input is sent to the Graph Manager. 

Process of data is started as soon as the input data are acknowledged by 
the Functional Unit. The unit communicates with the Graph Manager indicating 
that the process is finished when the process is done and that it is ready to 
place the output data in the Global Memory. The token information of the 
output places of the associated node is examined and it is verified that the 
output places are empty (the latter event is true only for the triple node 
model). The code for Outputs Empty is sent to the Functional Unit that is 
working on that node . 

The data is written to the output places once the Functional Unit has 
clearance for writing. The Graph Manager is informed when the output is 
written and the Functional Unit Is freed; I.e. , the Functional Unit is in a 
wait state until the next task is assigned to it. 

IV. 1.3 Graph Manager Diagnostic Routines 

The entire concurrent processing system Is accessible to the Graph Man- 
ager; therefore, the Graph Manager is the most suitable subsystem to inform 
the outside world of what events are taking place in the concurrent system. 

In order to keep a proper time record of the different events in the 
graph, an internal real-time clock is started simultaneously with the exe- 
cution of the graph. As each event is recorded, the clock is read to register 
the time at which the event is taking place. 
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recorded. These correspond to 


There are five different events that are 
ch . communication events previously mentioned: 

1 (F) Firing of a process node and binding to a Function 


"Assign Task" . 

2 (I) input places read by the Functional Unit (process node). 
"Acknowledge Input" . 

3 (P) Process done by the Functional Unit (process node) . 

4 (S> Output places empty. "Enable Outputs". 

5 (0 ) Output places written by the Functional Unit (process nod.) 

"Acknowledge Output" . 


it should be noted that after a node and a Functional Unit have been 
assigned to each other, they cannot be distinguished from each other. They 
become one entity and is the only time when either one. the nod. or the 

Functional Unit, is considered active. 

Every event is recorded in the following format: 


T( clock }N {node number} (event) [functional unit number] 


where (event) can be any of the next letters 
F (The node fires) , 

I (The input data is read) , 

P (The process is done) , 

S (The output places are empty) , and 
0 (The output data is written) . 
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The parameter of [functional unit number] is only written when {event} is 
equal to 'F.' To simplify the reading of the file, commas are inserted 
between every letter and number. The output file of the Simulation program 

(Ref. 21) does not require such an adjustment or addition since it is already 
provided with commas . 

Any probe that is installed in a system for testing purposes introduces 
some error in the reading. The probe used here is no exception to the rule 
and, in order to minimize the error, a special interrupt driven routine was 
written. The diagnostic routines use a buffer to write the information of 
every graph event. This buffer is accessed every time the real-time clock is 
incremented and if the serial port to the IBM- PC is ready to send a character, 
this routine sends the next character in the buffer. If there are no charac- 
ters in the buffer or the serial port is not ready the routine just increments 
the internal clock and exits without further action. To minimize the time 
that would take to write the commas to the output, a post-processor program 
was written that inserts the commas in their proper places. Due to the low 
speed communication channels, this scheme is good enough to minimize any delay 
introduced in the system by these Diagnostic Subroutines. 

IV . 1 . 4 Sequential Acc ount for Concurrent Processing 

All the events that are reported in the format explained in Subsection 
IV. 3. 3 are captured in a file that becomes what has been called the "ticker 
tape". This file contains all the necessary information to analyze the per- 
formance of the system. This file is called the FIPSO file because it 
accounts for Firing, Input, Process, OutStat and Output of every node in the 
graph. OutStat is the "enable outputs" signal sent by the Graph Manager to 
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the Functional Unit as shown in Fig. 18. A sample of a FIPSO file is pre- 
sented in Fig. 19. 

If the time between two different events is desired, the difference 
between the first and the last has to be computed. Or if the number of 
computers that were working at the same time during a certain interval is 
requested, the computations or procedures to obtain this number are much more 

complex, but not impossible to obtain. 

With this kind of information, the encumbering and depositing of tokens 
can be monitored, although there Is no direct information about these parti- 
cular events. Knowing the graph topology, the depositing of tokens is done 
when a node writes data to its output places. The tokens are encumbered when 
a specific node is fired. Although it is not obvious, any type of event can 
be registered with this information. Getting the information can be a complex 
job but with the help of a specialized program this can be done rather 

easily. 

IV. 1.5 Analyzer Program 

The Analyzer is a program that reads FIPSO files and obtains different 
data from the execution of the given graph (see Fig. 20). The data is 
processed to obtain such information as TBO and TBIO. 

The file is read and the information is placed in a two-dimensional 
array, which for convenience is also called the FIPSO array. This array has 
fields defined as follows: 
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Clock 


Node 1 


Node 2 


Node 3 


Event #1 [ ] [ ] [ ] [ ] . 

Event #2 [ ] [ ] [ ] [ ] . 

Event #3 [ ] [ ] [ ] ( ] . 


The clock field contains the value of the clock at the time of the event. 
The node field contains a code that indicates the event the node is in and, if 
in any, what functional unit is working on it. 

The primary display of this program shows the activity of every node in 
the graph (see Fig. 21). The display is actually several plots aligned in 
time, i.e., all of them sharing the same time axis. In this way the activity 
of every node can be compared with the rest. For example, it can be deter- 
mined if several nodes were active at the same time. Another display shows 
the activity of every functional unit instead of the nodes (see Fig. 22). 

Among other data, the concurrency of the system can be extracted at any inter- 
val in time or for the entire graph execution. In this manner, there is a 
display of the concurrency as a function of time. Other data are obtained and 
are explained in detail in the following sections. 

IV. 1.6 Measurement of TBIO. TBO. TBI 

To measure TBIO, TBO, and TBI of the system, there is the need to know 
which are the input and output nodes of the system. Since this cannot be 
reliably extracted from the obtained information, these are parameters that 
have to be supplied beforehand to calculate the desired data. After the 
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program determines which nodes are the input and the output of the system, it 
proceeds to search in the matrix for occurrence of 

1) When input data is read by the input node, and 

2) When output data is written by the output node. 

These times are recorded in another matrix for further use. Every time an 
output is written by the output node the time from its corresponding input is 

calculated and stored in the same array. 

After every output has been recorded, TBI and TBO are calculated. For 
TBI, this is done starting from the last input entry and going down to the 
second input entry, substituting the ith entry by the difference of the ith 
entry and the i-lst entry. Calculation of TBO is done similarly, except that 
the output data is used instead of the input data. This output difference 
calculation may be expressed by 

tOi - tOi - tOi-1 for i - n, n-1, n-2,...2 

where n is the number of outputs. The input difference calculation is simi- 
larly performed by 

tli - tli - tli-l for i - n, n-1, n-2,...2 
where n is the number of inputs . 

The display yields such information as when the system reached steady 
state (see Figure 23). When TBI, TBO, and TBIO do not change from one data 
packet to the next the system is said to be in steady state. 
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IV. 1.7 Concurrency Measurement 


Concurrency is the property associated with the capability of a computing 
system of executing two or more tasks at the same time. The concurrency 
function or what has lately been called the "Resource Utilization Envelope" 
can be measured or displayed in a rather simple fashion. 

To obtain the concurrency information, the FIPSO array is swept in its 
two dimensions. The array is swept along the "event” rows and along the 
clock and nodes columns. (see Subsection 3.4). At every row in the array, 
every node is checked for activity and the sum of all active nodes is obtained 
for that time or row. This is done for every row in the array and the 
function of number of resources vs. time is plotted on the screen. This is 
the Concurrency Display (see Fig. 24) . 

There is a value that is also obtained. It is called Computing Power 
(CP) . This value is equal to the area under the curve of the Concurrency 
Display or the "Resource Utilization Envelope". The units of this figure is 
"computer- seconds" . The "Resource Utilization" can be obtained by 


n * Tg 

where RU is Resource Utilization (%) , CP is Computing Power (computer- 
seconds), n is the number of resources (computers) and TE is Execution Time 
(seconds) . These two quantities can be obtained for the entire execution or 
for a portion of it. The interval over which the evaluation is made is 
defined by the user. 
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A table showing percentages of numbers of resources concurrently used 
with respect to the execution time is displayed. Thus the maximum possible 
concurrency and its percentage with respect to the execution time can be 

determined. 

XV . 1 . 8 General Statistics 

The different transition times have an exact value In the original simu- 
lation program (Ref. 21). However, in a hardware implementation there are 
some variations in these transition times. For example, a memory reading may 

take a longer or shorter time than expected. 

There is a menu option that allows the user to get the average transition 

time for any node. The only parameter supplied is the node number. The 
program will scan the FIPSO array and calculate the average time to read the 
input data, process the data, wait for output place clearance and write the 

output data for the node indicated. 

in a hardware Implementation of this concurrent system, the different 

computers that serve as resources or functional units may have different main 
clochs. or can be totally different computers and of curse have some differ- 
ences in the time that they would tahe to either read, process or write data. 
This provides a way to obtain average time values of the activities in the 

system for any given node. 


XV . 1 . 9 Graph Slmulat ion/Analvgei 

The Analyzer program is an invaluable tool for the analysis of the FIPSO 
file of a single simulation. If the need for exploring the effect of param- 
eter variation arises additional program support is needed. This program is 
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called Graph Simulation/ Analyzer program. This program controls the simu- 
lation and, immediately after execution, analyzes its data to obtain the 
desired result or reading. Sometimes only a certain number of values are 
required to be analyzed and then this specialized program is ideal for auto- 
mated or batch simulation and execution analysis. An overview of its features 
is presented in this subsection. 

The Graph Simulation/Analyzer program contains basically the same simu- 
lation kernel that the original Simulation program (Ref. 21). It has been 
modified to provide the use of random variables as transition times. 

The original Simulation program is not only a simulator but also a graph 
creator, i.e., the graph need not be defined when the program is called, but 
can be defined by the use of graphics commands. The Graph Simulation/Analyzer 
needs to be supplied with a graph description and simulation control (GDSC) 

( s ® e fig. 25) . This is a text file that can be created with any pure 
ASCII word processor and the command syntax can be found in the manual of the 
program in the appendix of this thesis. 

The main purpose of this new program is to "schedule" a series of simu- 
lations of a graph, change parameters, and collect specific output data such 
as ATBIO (Average TBIO in steady state) or the usual FIPSO files. One of the 
advantages over the former simulation program is that most of the program 
setup can be in the GDSC file or, in short, the graph file. In this way, 
setting up a simulation can be as quick as loading the graph file and typing a 
few keystrokes. One of the disadvantages is that the execution of the graph 
cannot be seen graphically. The only parameters that can be observed are the 
clock and the number of outputs from the graph. Even the clock can be 
suppressed from updating to reduce screen update overload. A notable differ- 
ence with respect to the former simulation is the capability of adding random 
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variables to 


the different transition times in a graph. The range of vari- 
ation is specified by the user in the graph file. 

The new program is suitable to integrate a design tool for the concurrent 
processing systems under study. The automatic control of the simulation 
routine makes the program ideal to find, through iterations, some optimum 

performance parameters for a given graph. 

The program provides on-line context-sensitive help. At every stage 
where user intervention is expected, the key Fl can be typed and a window 
appears providing specific explanation of what the user may do at this part of 
the program. The help window information can be as simple as the statement of 
the purpose of the menu option or examples to illustrate the possible choices. 

This program is expected to be changed in the future and to undergo a 
series of enhancements. This is the reason it was written in C language, a 
flexible and simple, yet powerful and easy- to -maintain language. The program 
can be easily expanded or modified to meet the future demands of the ongoing 

research . 

IV . 1 . 10 Output of the Graph Simulation/ Analyses; 

The Graph Simulation/Analyzer program generates only four kinds of files. 
These are Average Time Between Input and Output (ATBIO) , Average Time Between 
Inputs (ATBI) , Average Time Between Outputs (ATBO) and the FIPSO files. The 
"average" files collect data that is calculated once the system has reached 
the steady state. The computation of the steady state values is done by the 

use of a running average, in the following manner: 

1- An average is computed for the first six outputs (TBIO.TBI or TBO) and 

stored in an average array. 
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2- The first of those outputs is then discarded and the seventh output is 
taken to form the next six outputs. 

3- Another average is computed for the new six outputs and is stored in 
the next location of the average array. 

4- This procedure is applied until there are no more outputs left to work 

with. 

5- The next step is to find which of the computed averages is within a 
+/- 1% of its predecessor. 

6- An overall average is calculated beginning with this predecessor up to 
the last average and this is the ATBIO, ATBI or ATBO. 

The FIPSO files are obtained the usual way, that is, from the recording 
of every event, every event code is translated to text and the FIPSO file is 
created. This file contents can be examined in the Analyzer as explained in 
the last sections. 

There are some instances when, although the steady state has been 
reached, the program will print "N/SS" (Non-Steady State) instead of the 
value sought. This usually occurs because the running average has too few 
outputs to work with and the reaching of steady state is hidden in one of the 
averages, i.e., the +/" 1% is too restrictive to detect it. Another error 
message that can be given is :"N/EO," meaning "Not Enough Outputs." The 
reason for this message is that there are less than nine outputs to work with 
and it makes it difficult to calculate the average. 

The method of running averages is adequate to find when the graph reaches 
steady state. However, it requires many graph outputs which may create a 
great time burden in terms of simulation time. These computation factors 
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depend on the number of nodes of the graph, execution tine and number of 
resources available . 

V.o EXPERIMENTAL RESULTS 

V 1 Introduction 

This Subsection presents the use of the Analyzer and the Graph Simu- 
lat ion/ Analyzer programs to evaluate the performance of two different graphs, 
in Subsection TV.4.1. a graph with parallel paths is investigated. TBOLB an 
T.I013 are calculated and a simulation of the system is performed. Analysis 
o£ rhe output data is used to obtain the minimum number of resources necessary 
to obtain maximum performance regardless of priority assignment. Subsectron 
IV 4 2 is dedicated to investigate a graph with iterative loops. The same 
data are obtained as in Subsection 1V.4.1. Subsection 1V.4.3 presents two 
performance factors based upon TBOLB and TBIOLB. 

v 2 Graphs With Parallel Paths 

Graphs with parallel paths are Important due to the possibility of high 
concurrency in the execution of Cashs. Fig. 2S present, an example of a graph 
with three parallel paths. This example is used to illustrate 

of TBOlb and TBIOlb- 

The first step to calculate TBOLB and TB10U, is to choose a Node Harhe 
Graph. The Single-Node model is selected because the resulting CMC is dead- 
loch free. The second step is to obtain the CHC for the given graph. This s 

, *. trHt pig 28 shows th© 

longest time to execute in order to obtain and get TBOu,. 

ax the fourth step, the path from the input to the output of the graph with 
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the longest time has to be located. This is shown in Fig. 29. This time is 
TBIOu and is equal to 2240 time units. The fifth step is to calculate the 
data injection rate which is controlled by the input source node. The time 
that has to be associated with this node is equal to the inverse of the input 
injection rate. To obtain the effective input rate to the graph, it is neces- 
sary to consider the input read time of the input node. The source node will 
fire when a token is placed at its control edge. This is done when the input 

read time of the input node is over. Therefore, the source node write time is 
equal to 


Write time - TBOlb * Input read time (Input Node). 


The effective input rate to the graph is 


IR " l/( TBO^ * c INl) 

where IR is the input rate, and tINl is the input read time of node 1. Since 
IBOlb is 1065 and Ifqi I s 140, the source node write time is 925. 

V.2.1 Simulation 

The simulation is performed with the calculated data for all possible 
number of resources. The simulation is executed for one resource, two 
resources and so on, up to seven resources. The data is input to the Graph 
Simulation/Analyzer by means of a Graph Description and Simulation Control 
file. The simulation is stopped when the graph has processed fifteen data 
packets. The GDSC file used to simulate the example is presented in Fig. 30. 
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Average TBO , Average TBIO and the FIPSO files are gathered for every 
simulation cycle. Resource Utilization (RU) and maximum number of resources 

concurrently used are obtained from these files. 

The simulation was also run for another priority assignment. The former 
priority assignment tries to output as many data packets as possible; the 
latter tries to load the graph to its maximum before an output is written. 

The first priority assignment has its highest priorities toward the output of 
the graph, i.e., the closer to the output the higher the priority. In this 
way, the highest priority in the graph is to process and output data. The 
system tries to output data as soon as possible. The second priority assign- 
ment tries to input as much data as it can before data is output. The closer 
a node is to the input of the graph the higher is its priority. 


V.2.2 Analysis of Outpu t Data 

The Graph Simulation/Analyzer and Analyzer data are tabulated in Tables 1 
and 2. The computing power is about the same for every case since it is the 
total computing power required for processing fifteen data packets. The 
resource utilization decreases with the increase of number of resources . The 
resource utilization is almost the same for one and two resources. For three 
and more resources the resource utilization decreases more drastically for a 
change of one resource. For every resource added to the system the resource 
utilization is reduced by about ten percent. 


TBOLB is closely achieved using more than four resources. The small 
difference is due to the overhead time introduced by the Graph Manager, or the 
Simulation, in the scanning and firing of the nodes of the graph. TBIOLB is 
obtained using more than two resources. Again, the difference with respect to 


the calculated value is due to the scanning of the graph. 


49 


This value of TBOLB was obtained for two different priority assignments. 
The value of TBOLB is not calculated based on priority assignment but on the 
transition times in the circuits of the graph. If it is obtained for a given 
number of resources, it should be maintained regardless of the priority 
assignment for at least the same number of resources. 

The maximum number of resources used concurrently is five. After five 
resources there is no effect on adding resources except to lower the resource 
utilization. This graph can be executed at its optimum performance with five 
resources . 

V.2.3 Minimum Number of Resources for Maximum Performance 

Two important values are observed in Table 1. These are the minimum 
number of resources necessary to obtain TBOlb and the minimum number of 
resources necessary to obtain TBIOlb* TBOlb is attained for at least five 
resources and TBIOuj is attained for at least three resources. The minimum 
number of resources for maximum performance is five since with this number of 
resources TBOlb and TBIO^b is obtained. This minimum number of resources 
coincides with the maximum concurrency in the graph. This value has been 
obtained, by theoretical means, by the ODU research team and has been called 
Rmax. 

It is important to test if this minimum number of resources is indepen- 
dent of priority assignment. The simulation of this graph was run for five 
resources and for every possible priority assignment. It turned out that the 
maximum performance was obtained for every priority assignment. This test 
method is not recommended as a common practice since it requires too many 
hours of simulation execution. It was done here as an exercise. It was done 
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to test that this minimum number of resources is independent of priority 
assignment for this example. 

Th. minimum number of resources me which Che TBIOLB is preserved Is not 
priority independent. A priority can be found at which, for this number of 
resources, TBIO Is higher than TBI 0lb . Table 3 shows the results for the same 
graph with a different priority assignment than th. last two. The minimum 
number of resources at which TBIOlb Is preserved Is four Instead of three as 
in the last two examples of priority assignments. 

It should be noted that the first two simulations performed in the graph 
did not require more than thirty minutes, making the use of the Graph Simu- 
lation/Analyzer and the Analyzer a viable method to evaluate the performance 

of a given algorithm graph. 

V.2.4 Grachs with Itera tive Loops 

Graphs with iterative loops belong to another class of graphs that is 
important to the ongoing research. These kinds of algorithm graphs are found 
primarily in applications such as digital signal processing or control sys- 
tems, where data from predecessor cycles are needed for computation of a 
present data packet. Figure 31 presents an example of a graph with iterative 

loops . 

The Single -Node model is also used in this example to model the nodes in 
the graph. Figure 32 shows the resulting CMG, using the Single-Node model, of 

the graph . . 

The circuit with the longest time per token in the CMG is located in 
either of the iterative loops, nodes 2 and 5, or nodes 3 and 6 . Since there 
is only one token in the circuit, the value of TBOuj is 960 time units. The 
effective write time of the input source is equal to TBOlb less the read time 
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of the input node. The value of the write time of the source node is 890 time 
units . 

Following the procedure described in Section 2.8, nodes 5 and 6 are 
eliminated to calculate TBIO^. The value of TBIOlb is equal to the sum of 
the times from the input source to the output sink. This value is 1600 time 
units . 

V.2.5 Simulation 

The simulation is performed with the calculated data for all possible 
numbers of resources. The simulation is executed for one resource, two 
resources and so on, up to six resources. The data is input to the Graph 
Simulation/Analyzer by means of a Graph Description and Simulation Control 
file. The simulation is stopped when the graph has processed fifteen data 
packets. The GDSC file used for this example is presented in Fig. 33. 

Average TBI , Average TBO , Average TBIO and the FIPSO files are gathered 
for every simulation cycle. Resource Utilization (RU) and maximum number of 
resources used concurrently are obtained from these files. 

The simulation was run for two priority assignments. This difference in 
priority assignments was explained in Subsection IV. 4. 1.1. 

V.2.6 Analysis of Output Data 

The Graph Simulation/ Analyzer and Analyzer data are tabulated in Tables 4 
and 5 . R max is equal to three for this graph with iterative loops . Both TBO 
and TBIO degrade for numbers of resources less than Rm ax . This is different 
from the case of the example of Subsection IV. 4.1 in which only TBO degrades 
below R max (in the mentioned example TBOLB is also attained for one and two 
resources below R^x) . For the first priority assignment TBIOlb is still 
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obtained for two resources, but for the second it degrades. This behavior 
indicates that, for this graph, TBIO is priority dependent below R max . 

There is a difference of ten or eleven time units between ATBI and ATBO 
which is not expected since ATBI and ATBO should be equal for the conditions 
of the simulation. There is also an increase in the average of TBIO with 
respect to ATBIO for two resources in the first priority assignment. A more 
detailed observation of the execution in the Analyzer reveals that the differ- 
ence between TBO and TBI is being added to TBIO at every data packet. Every 
time a data packet is injected in the graph, it takes ten more time units to 
arrive to the output than the precedent data packet. This is the reason of 
ATBIO to be much higher than expected. The reason of the difference between 
ATBI and ATBO can be observed in the Analyzer. The critical circuit, nodes 
two and five, takes more time than calculated due to the scanning of the nodes 
in the graph. This increase is directly applied to TBO, but TBI continues 

being the same that was calculated theoretically. 

The source write time was Incremented to 900 and the simulation was run 
again. The results are as expected: ATBI Is 975, ATBO is 975, and ATBIO Is 

1620 for Rmax* 

The increase in the source write time is an experimental adjustment to 
obtain the best possible performance. This yields an expression for a lower 
bound TBO adjusted to compensate for system overhead during the execution: 

TBOlba - TBOlb + E 


where TBO^ is the adjusted lower bound for TBO, and E is the adjustment 
factor obtained from the simulation of the graph, or in the case of a hardware 
system, the one obtained by executing the graph. 
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It should be noted that this adjustment factor, E, was not necessary for 
the example of Subsection IV. 4.1. The two graphs of the examples are from two 
different classes of graphs. The graph of Subsection IV. 4.1 belongs to a 
class that has its input circuit directly "coupled" to the critical circuit 
(the circuit with the longest time per token in the CMG) . Two circuits are 
coupled when they have a transition in common. The graph of this section is 
from a class that has its input circuit "uncoupled" from the critical cir- 
cuit, i.e., they are connected through other circuits in the graph. The graph 
of section 4.1 is not as sensitive to variations in the time of the critical 
circuit as the graph of this section. Since this subject is not in the scope 
of this thesis, there will be no further analysis of these classes. 

Without the help of the Simulation and the Analyzer, this adjustment 
could not be made in such a short period of time. These adjustments sometimes 
can be predicted, but the Analyzer is a required tool to discover these real- 
ization differences in performance. 

V.3 Performance Factors 

There is a need for an absolute time independent performance factor to 
classify the graphs by their performance. The absolute time in a given graph 
is not as critical as the relative amount of time each node has with respect 
to each other. If each and every transition time in any of the graphs evalu- 
ated in this chapter are multiplied by a constant, the resultant graph has the 
same critical circuit as the former graph. The difference is in the absolute 
value of the computations. If the appropriate injection rate is applied at 
the input, the same resource utilization is obtained. 
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The TBO performance 


factor (PFTBO) is obtained by 


pf TBO “ 


TBOlb 

TBO^ 


where TBO m is the measured TBO of the system. 

The TBIO performance factor (PFtBIQ) obtained by 


PFTBIO 


TBIOlb 

TBIO m _ 


where TBIO m is the measured TBIO of the system. 

It should be noted that the maximum possible value of these factors is 
10. The value of the performance factors for the graphs of Sections 4.1 and 
4.2 are presented in the Tables 6 and 7 , respectively. 
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VI . 0 FURTHER RESEARCH 


During the gr&nC period , the ATAMM model wes used es the besis for deter* 
raining analytically bounds for task computational time and system throughput. 
An operating strategy which achieves optimum time performance was developed. 
In addition, a new diagnostic tool was developed with which to evaluate per- 
formance and functional unit behavior. The diagnostic tool provided moni- 
toring of detailed system operations and the displaying of global system 
performance indicators and measures. 

Continuation of the present effort includes the development of a new 
multicomputer test bed. The operating system and communication processes are 
to obey the ATAMM model and to exhibit a completely distributed graph manager 
operating system. The operating system is to allow for continuously assigned 
functional units. This system is to be composed of personal computers com- 
municating over a local area network. 

The ongoing research has established ATAMM as a viable basis for the 
specification of data flow multicomputer systems. Further research should 
proceed in several directions. An outline of these activities is presented 
below. 

1. Fault Tolerance. Due to the inherent nature of the ATAMM model to 
allow continuous assignment of the functional units, the soft-fail 
nature of an ATAMM defined multicomputer system is evident in terms 
of hardware failure. That is, the algorithm may be expected to 
continue executing, though with degraded performance, with elimi- 
nation of functional resources. However, additional effort needs to 
be directed towards recovery strategies associated with error in the 
data. One applicable method is triple modular redundancy (TMR) , 
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which Involves the triplication of the node processes and majority 
voting. The TMR strategy needs to be Investigated with respect to 
both performance and the preservation of the ATAMM model. 

An Important part of the ATAMM research program Is to enhance the 
understanding of the relationship between performance measures such 
as TBIO, TBO . and TT with respect to the algorithm graph character- 
istics and the availability of functional resources. On the basis 
of recent observations, research Is to be directed toward the 
improvement of the performance measures as a result of modifying the 
algorithm graph by the addition of nonexecutable features such as 
control edges and "dummy" nodes. Present investigations suggest 
that these graph augmentations may alter and Improve certain aspects 
of performance without changing the underlying algorithm. 

Overhead. Research should be continued toward the refinement of the 
node marked graph (NMG) representation. This refinement should 
better model the time associated with communication overhead and 
other system overhead in relation to node process time. A goal of 
this modeling would be to determine limits on algorithm decompo- 
sition in view of graph complexity and Increased communication 

overhead. 

Advanced Hardware. An appropriate step In the ATAMM development Is 
the Infusion of the processing rules to advanced technology multi- 
computer hardware for avionic or space-bourne applications. An 
appropriate environment would Include VHSIC technology such as the 

MIL-STD- 1750A processor as the processing element. 

, i »j ...a. c n f ar the ATAMM model has been used 
Theoretical Advancements, bo rar tne 


under somewhat restricted conditions. Further research should 
include such issues as multiple graphs, nonhomogeneous functional 
units, reliability, fault recovery strategies, and system archi- 
tecture which takes advantage of the ATAMM model. 
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TABLES 


ORIGINAL PAGE IS 
OF POOR QUALITY 



3R4 FH V 

\Tn ?*PALlEl 

paths 



r njrjfijT \ 

AS3.GNMENT 

5 4i:;i 


4311 G 

^ESGUFC 

39 

E AVERAGE 

3243.0 

- y’EFAQE 
7BJO 
3235 0 

MAXIMUM 

CONCURRENCY 

1 


V 43% 

1627.5 

2582.0 

2 

43403 

30 07% 

1207 0 

::e5.o 

3 

43856 

43355 

68.55% 
57 31% 

1136 5 
1083 0 

2165.0 

4 

5 

4.3355 

4" 76% 

1083.0 

2265.0 


43355 

40 33% 

1083.0 

2265 0 

✓ 

5 


TABLE 1. Results from first experiment, first priority assignment 


GRAPH wiTri PARAdfi. PATHS 
cc !CPJTf ASSIGNMENT 12 7 3 4 5 


R-ESCw? 

: CES COMPUTING 

RESOURCE 


FQ'WE* 

UTTU2ATON 

t 

43818 

391H4 


50068 

87 67% 

j 

4348$ 

Oft 

uv.l V#* 

4 

4?533 

83 50% 

5 

50002 

57 34% 

6 

50002 

47 78% 


50002 

40 98% 


AVERAGE AVERAGE 

MAXIMUM 

TBu 

T3IO 

CONCURRENCY 

3243.0 

3235 0 

1 

1623 0 

25*2 * 

A 

1294.9 

2265.0 

J 

11350 

2285 0 

4 

1083 0 

2285 0 

5 

1083 0 

2265 0 

5 

1022.0 

2265.0 

5 


TABLE 2. Results from first experiment, second priority assignment. 
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ORIGINAL PAGE IS 

OF POOR QUALITY 


jP-.-pM wiTn FARAlLeL r ATHS 
PRiC'RJTl ASSIGNMENT 17 2 6 3 4 5 


RESOURCES 

COMFUTiNS 

RESOURCE 

c jVER 

UTILIZATION 

1 

43813 

93 11 % 

*> 

oi 

VCW i 

97 73% 

> 

49511 

80 * l 

4 

49992 

88 58% 

c 

49999 

t? ;pv 

6 

49998 

47.79% 


49359 

40 9"% 


AVERAGE AVERAGE MAXiMUM 
TBO TB10 CONCURRENCY 
3143 0 5838 0 1 

1 £22 0 3243 0 2 

1324 0 2346.8 3 

1137 0 2273.0 4 

1083 0 2273 0 5 

1083 0 2273 0 5 

1083.0 2273 0 5 


TABLE 3. Results from first experiment, third priority assignment. 
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ORIGINAL PAGE !S 
OF POOR QUALITY 


GRAPH WITH iTEPATIVE LUwPS 
RRiCRiT; 4 3 2 1 6 5 


PE SOURCES 

7CMP!.mr4G 

resource 

AVERAGE A 

VERAQE 

AVERAGE 

MAXIMUM 


“J’w't h 

uTiUZaTs-jN 

TBi 

TBO 

TBiu 

CONCJPPENC’t 

i 

3761 5 

39 16% 

2534 0 

2554 0 

2569.0 

1 

> 


37 24% 

1294 0 

1301 0 

15140 

*> 


29211 

85 fil Si 

304.0 

375 0 

1079 0 

2 

4 

39211 

B4 20% 

994 0 

975 0 

1679.0 

3 

5 

3321 1 

5i 30% 

354 0 

975 0 

1673 0 

3 

K 

3921 1 

42.80% 

954 o 

975.0 

1679.0 

3 


Table 4. Results from second experiment, first priority assignment. 


GRAPH WITH ITERATIVE LOOPS 
PRIORITY 5 612 3 4 


SESGURC 

ES COMPUTING 

RESOURCE 

AVERAGE .AVERAGE 

AVEPAGE 

MAXIMUM 


POWER 

UTILIZATION 

TBI 

TBO 

TBIO 

CONCURPENC 

i 

40188 

38.16% 

2594.0 

2594.0 

co oo n 

VVVW w 

4 

l 

L 

33319 

97 54% 

1294 0 

1300 3 

1987 9 


3 

39261 

85 97% 

361 0 

971 9 

1671 5 

3 

4 

39261 

A 4 43*4 

961 0 

97- 9 

1671 5 


c 

29251 

Cl *£*/ 

vi J Ui( 

961.0 

n 

Oil w 

1671.5 

*> 

j 

6 

35251 

42 35% 

361 0 

971 0 

1571 5 

3 


Table 5. Results from second experiment, second priority assignment. 
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0S8«W.--. !S 

OF- P0G3 Qw,JT/ 


EPFORHANCE FACTORS FOR SRAPH WITH PARALLEL PATHS 


Resources pf TBC 
1 C S2839S 

■■ 0 65*579 

? 0 31 48 A3 

4 0 937500 

c r. «ccj?4 

6 0 383373 

7 G 9s .' 3 ■ 7 


pf TB10 
0 692426 

0 667544 

0 983962 
0 388362 
0 368962 
0:388362 
0 338962 


Table 6. Performance factors for graph of Section 4 . 1 


PERFORMANCE FACTORS for GRAPH WITH ITERATIVE LuuPS 


Resources 


0 

4 

5 
: 


pp TBC 
0 370084 
0737893 
0 984615 
0 984615 
0 984615 
0 984615 


PF TEIO 
0 617999 
0 991325 
0 987654 
0 987654 
0.987654 
0.987654 


Table 7 . 


Performance factors for graph of Section 4.2. 
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FIGURES 




Figure 1. Algorithm mar 





NMG EDGE LABELS 

I F Input Buffer Full 

I E Input Buffer Empty 

D R Data Read 

PC Process Complete 

P R Process Ready 

OE Output Buffer Empty 

OF Output Buffer Full 


Figure 2. ATAMM node marked graph model. 
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Figur. 5. Modified algorithm graph for Figure 1. 
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Figure 8. Computational marked graph for design example. 
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Figure 12. Resource envelope overlay diagram with TBO = 3. 
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Figurt 13. Resource envelope overlay diagram with T80 = 3.5. 
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0123456789 10 11 


Figure 14. Resource envelope overlay diagram with 
TBO = 4.0. 
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TBIO 


Figure 15. Example algorithm graph performance 
analysis summary. 
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Figure 16. Performance margin for example algorithm. 












Fipcn file 

TJ2,N,1,F.T 
T, 116, N. 1.1 
T,249,N.1.P 
T.250.N.1 ,S 
T.276.N.1.0 
T,2?8.N,2,F,2 
T.322.N 2 i 
T.323.N.1.F.3 
T,367,N,1 ,1 
T,455.N,2,P 
T.456,N,2,S 
1,492. N, 2.0 
T,485,N,3,F,4 

T.500.N.1.P 
T.501.N.1.S 
T.527.N.1 ,0 
T 661, N, 3.1 
T.663.N.2.F.5 
T.707 N.2,1 
T.798.N.1.F.1 
T.752.N.1 ,1 
T.840,N,2,P 
T,841,N,2,S 
T. 867. N. 2,0 
T,885,N,1 ,P 
T.986.N.1.S 
T,912,N,1,0 
T.1190,N,3.P 
T,1191,N.3,S 
T, 1295, N. 3,0 


vf-rd ri oc i rr ^ ptiQa 


< — 

< — 

< — 

< 

< — 

< — 
< — 
< — 
< — 
< — 
< — 
< — 
< — 
< — 
< — 
<-- 
< — 
< — 
< — 
<- 
<- 
<- 
C- 
<- 
<- 
< 

< 

< 

< 


is fired at time 72 and assigned to FU1 
read the input places at time 116 
finished the process at tune 24a 

gS c?farance P to output data at time 250 

wrote the output data at time 276 
is fired at time 278 and assign^ to FU2 
read the input places a 1 pj 3 

1 ^ ^plSsttfi^f 

1 got'clearance^t^Mtput data at time 456 
3 fsTir^arS Slg- » «. 

1 daT. at time 501 

1 wrote the output data at tune 

2 IsiirSd at P t ime 1 663 S and assigned to FUS 

? fsliS LTigng to FU1 

1 read the input places at tune 752 

1 got i clearance P to C output d^a at^time S41 

2 wrote the output data at tin 

1 ^ar^rrtpft SS* time 886 
1 wrote the output places at tune 912 

- — 3 Mr^trStpSt d^a at time 1181 


Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
Node 
• Node 
■ Node 

- Node 

- Node 

- Node 

- Node 

- Node 
Node 


— Node 


Figure 19. A sajnple FIPS0 file. 
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Concurrent 

Processing 



System 

Simulation 


Figure 20. 



FmiuuiuiLj 



Personal 

Computer 


'1 1' 

I: :! 


f Juullmill 


ATBO 

ABTIO 



Comparison 
for change 
of parameters 


Generation of Data 


Analyzer information flow. 






TIKE 


1 


5-TIKE 

e 




M 



mmm 



Assigned FU’s 

Input/Output 

Toggle displays 
Split cursor 
ferge cursors 
Factor (cursor) 
Define window 
Restore window 
Node Statistics 
Concurrency 
Quit 


* 


CURSOR X 1 TIKE UNITS 


t DEPTH [ 1 1 

Nunber of events: 543 Execution tine: 17436 


Figure 21. Analyzer Node Activity Display. 
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ITEM 


nj ACTIUIT? DISPLAY 




TIME HIKE 
1 0 

Assigned FU's 
Input/Output 
Toggle displays 
Split cursor 
Nerge cursors 
Factor (cursor) 
Define window 
Restore window 
Node Statistics 
Concurrency 
Quit 


y 



CURSOR X 1 TINE UNITS 



Nunber of events: 543 Execution tine: 17436 


Figure 22. Analyzer Functional Unit Display. 
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ORIGINAL PAGE IS 
OF POOR QUALITY 


]mA IHNTWTPU! HSPLA* 

TIME 5-TIME 

1 0 

TBl TBO TBIO 

1 

i 

j 

Assigned FU* S 

r t:~ -r227t 2273 

i 

l 

Input/Output 

! 2 : 1067 1067 2273 

1 

i 

Toggle displays 

S .3: 1099 1099 2273 

i 

Spilt cursor 

! ' 4 : 186 ? 1867 2273 

j 

Merge cursors 

! s: 1039 1093 2273 

i 

I 

Factor (cursor) 

’ &: 106? 1067 2273 

1 

1 

Define window 

■ 7 : 1899 1099 2273 

l 

Restore window 

1 8 '- 1067 1067 2273 

i 

\ 

Mode Statistics 

; 9 : 1099 1099 2273 

1 

i 

Concurrency 

'■ 18: 1067 1067 2273 

1 11:. 1039.0099^23 — 

' 12 : 1*1067 1067 2Z73 


Quit 

S 13 :/ 1099 1099 2273 



• 14 f 1067 1067 2273 
1 15,“ 1099 1099 2273 

1 j 
J 

1 f 

1 1 

! 1 
1 • 

1 / 

• 1 


t 

1 

1 


1 

1 

1 

J DEPTH l 1 1 

ini »V 

CURSOR X 1 HI® MIS 

Hunter of events: 543 Execution tine, in jo 

Figure 23. 

Analyzer Input/Output Display. 
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ORIGINAL PAGE IS 
OF POOR QUALITY 


CQHCUBRPjCV DIS PLrtV 
CQMPUnWG POWER: 160,9201 OOMP-SECS 



mm 


■ M 

'///. 

//// 

,f YA 




I y///ffv/t-*A 

I mm 



CURSOR X 1 TIHE UHITS 


- 


TIME 6-riHE 

1 8 

Assigned FU's 
Input/fotput 
Toggle displays 
Split cursor 
Nerge cursors 
Factor (cursor) 
define window 
Restore window 
Node Statistics 
Concurrency 
Quit 


STATISTICS 
CONCURRfXCy 
4-FU'S = 17.7 
3-RJ'S = 22.7 
2-FU'S ^ 40,9 
MU = 11.5 
fr*JU'S = 7,2 


■ DEPTH I 1 ] 

Nunber of events: 543 Execution tine: 17436 


Figure 24 . 


Analyzer Concurrency Display. 
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Graph Description It 
Simulation Control File 


For Visual Analyzer 



Figure 25 . Graph Simulation/Analyzer information flow. 






Figure 26 . Graph with parallel paths. 




vO 



Figure 27. CMG using Single Node model. 
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Figure 28 . Circuit to obtain TBO 





ORIGINAL PAGE IS 
OF POOR QUALITY 


ft *■* Simulation of a graph wih parallel paths "i* 
tt in it simulated w.rh tijii range o’ r esoure es *** 
tf «** and with wo different pnoray assignments *»* 

Graph Graph with parallel oaths t BQl£ - 1 065 TPlO|_g = 2240 

Nodes ? 

Sources 1 
Sinks 1 
Places 10 

Resources -1 » *** c 'crrt t to 7 resources **» 

F nontv 5 4 3 7 2 6 1 If Alternate assignment 16 2 7 3 4 5 
Tokens 1 » D ata a-- 1 ailatle at thie input node 

Mode! Single 

no ..t i « Thie input node is node 1 

Output 5 » The output nooe is node 5 

Times W Global time assignment 

Read 70 # These time assignments are for all 

Process 2i 0 tt nodes in the graph. They can be 
write 40 * overridden later on 

Node 1 
incuts 1 
Outputs 2 7 

Time Local time assignment 

Read 140 tf These time assignments override 

Process 420 « the giot a) time assignments 

wme 30 * 

Node 2 
inputs 2 
Outputs 3 3 

Node 3 
inputs 3 

Cu’PUti 4 

Node 4 
Inputs 4 10 
Outputs 5 

Figure 30. Graph Description and Simulation Control file 
used for the first experiment. 
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ORiQ^SAl PAG- 

OF POOR QUALITY 


Time 


P* id 1 40 
Process 420 


# Local time as 5 ’ 9 n ment 

time assignments ovemae 
g the global time assignments 


Write 80 


Mode 5 
inputs 5 3 
Outputs 6 


Mode 6 


Inputs ? 
Outputs 3 
Time 

Read 


# Local 

141] 


Process 420 


Write 30 


time assignment 

These time assignments override 
W the global time assignments 

It 


Node 7 
inputs 3 
Outputs 1 0 

Source 1 
Outputs 1 

,T 6 Write 325 * Source output wrne time is TBOijj - INI 

# Write time = 1055-140 

S»nk 2 
moos 5 
Time 

a »id 70 

p d M This *nds th« Gr»ph D«scriDtion Ftl* 


Figure 30. (Continuation). 
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Figure 31. Graph with iterative loops. 




Figure 32 . 


CMG of the graph’ in Figure 31 using 
Single Node Model. 




ORIGINAL PAGE IS 
OF POOR QUALITY 


* *** Simulation of a graph wrth r festive icoos *** 

# *** It 15 simulated with tu.; ra^ge of resources *** 
n #** arid with two afferent priority assignments *** 


Grach Graoh with iterative loops TBGfGLB) * 980 TEiOfGLS) = 1800 
f lodes 6 
Sour res i 
i#inks 1 
Places 3 

Resources -1 # *■*" Prom 1 to S resources *** 

Pnonh’ 4 3 2 1 8 5 # Afters? assignment 5 61 2 3 4 

Tokens 12 9 tt Data available at the input node 

ti arid tn outputs of the iterative loops 

Model Single 


input i 
Output 4 

~ mi5 

Read 

Proce 

Wnte 


# The input node is node 1 
tt The output node is node 4 
# Global time assignment 

20 # These time assignments are for all 

? 210 if nodes in the graph. They can be 

40 ti ovemdden later on. 


Node i 

‘npijts 1 
Outputs j 


Nod# 2 
inputs 2 7 
Outputs 3 £ 


Nod# 3 
inputs -3 3 
Outputs 4 3 


Figure 33. Graph Description File for the second experiment. 
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nm* 


Pead 14 * 

PfOC^rii 4*0 

Wr*e 30 


tt Local time assignment 

* These time assignments override 
g the global time assignment* 

# 


Node 4 
Inputs 4 
CUvpUtS *5 


Node 5 
Inputs 5 
Outputs 7 

Time # Loc* 

Read 1 40 
Process 420 
write 30 


time assignment 

n These nrne assignments override 
#the global time assignments 

n 


Node 9 
Inputs 8 

Outputs 3 

Source 1 
Outputs i 

Tme 

Wme 390 

$in^ 2 

Inputs 5 
Tme 

Read 70 


H Source output write timejs 
ft Write time = 360-70 


TBO(LB) - INI 


g This ends the Graph Descrto^on File 


Figure 33 . (Continuation). 
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THF ATAMM PROCEDURE MODEL FOR CONCURRENT PROCESSING OF LARGE 
GRAINED CONTROL AND SIGNAL PROCESSING ALGORITHMS 


John W. Stoughton and Roland R. Mielke 
Department of Electrical and Computer Engineering 
Old Dominion University 


ABSTRACT 

An overview is presented of a model for 
daubing data and control flow associated with the 
execution of large grained, decision free algorithms in 
a special distributed computer environment. 
1,1, ii’iiii'd bv the acronym ATAMM* which represents 
_\b:oni iim-XO” Architecture Mapping Model* the 
1 1 uk let provides a basis for relating an algorithm to its 
execution in a dataflow multicomputer environment, 
lire ATAMM model features a marked graph Petri 
u< 'i description of the algorithm behavior with regard 
id both dam and control flow. The model provides an 
analytical basis for calculating performance bounds on 
input characteristics which are demonstrated in the 
|miv:\ 


INTItODUCTlON 

Hie development of new computer architectures 
k,- d upon distributed, multiprocessor organizations 
; . jj i* motivated mainly by the requirement for 
ii;r:<\i>ed sporu and greater throughput capability in 
complex signal processing applications id). With the 
ulv^nT of hidi-density microelectronics the construction 
oi parallel architectures consisting of identical, special 
pnimitfc computing elements is now a realitv [4).{5j. A 
•mmiet m models for describing the behavior of 
kinm Ians in t his setting have been developed [6] — (8j. 
i luwiwer. ’ hev* models represent only the data flow and 
do not adequately display the complex issues of 
vomuiuiiicat ion and control flow which must occur in any 
.tint!. Thus, it has ixvn difficult to investigate how 
mi flirt lively match the decomposition and schedulin' of 
io the structure and control of parallel 
, ic.jia lures. The importance of bettor understanding 
i ho relationship between algorithms and architectures is 
uiiiy now uecoming recognized [9]. 

This paper presents an overview of a graph 
ihrorctic model for describing both data and control How 
associated with the concurrent processing of large grained 
.duoritlim.'* in a special distributed computer environ- 
mem This model is identified by the acronym A 1 A MM 
which represents Algorithm lo Architecture Mapping 
Mndrl. 

The purpose of the ATAMM model is important 
fm rhi-fv reasons. First, the model provides a hardware 
ikpriiih nt context in which to investigate the relative 
j t s of differrut algorithm dc(om|x>Mtion and 
::;.:iiemontatiou strategies. Second, the model defines the 
dr,, flow and control flow which mu.st be manifested by 
t.uv dataflow computer architecture implementing the 
decomposed algorithm. Third, the model provides an 
aiu.iyrical basis for performance cvaJuation. 


The problem domain of the ATAMM model 
consists of large-grained, dccision-frce algorithms with 
computationally complex primitive operations which are 
assumed to be implemented in a dedicated distributed 
dataflow environment. The algorithms are such as may 
be found in (but not limited to) large scale signal 
processing and control applications. A potential 
multicomputer environment might consist of two to 
twenty processing elements composed of VHSIC 
technology. 

ATAMM MODEL DEVELOPMENT 

The composition of the algorithms of interest may 
be such that two or more operations can be performed 
concurrently. Thus, the potential exists for decreasing 
the computational time required to executing the 
algorithm by increasing the computational resources 
which process the large grained primitive operations. 

The hardware environment (Figure 1) for 
executing the decomposed algorithms is assumed *o 
consist of R identical processors or functional units 
(FI* Ns) where R has a value in the range of two to 
twenty. This range of resources is suggested for practical 
reasons due to the large— grained aspect of the algorithm 
decomposition and the need to maintain small communi- 
cation times relative to process times. Each FUN is a 
processor having local memory for program storage and 
temporary input and output data containers. Each FUN 
ran execute any algorithm primitive operation. The 
l* l Ns -'hare a common global memory (GLM) which mav 
be either centralized or distributed. The coordination of 
FUNs m relation to data and control flow is directed by 
i he graph manager (GRM). The GRM also may be 
centralized or distributed. Transaction rules provide that 
diii mu created by the completion of a primitive operation 
is pi, ned into global memory onlv after the output data 
containers have been emptied. That is. outputs must be 
consumed as inputs to successor primitive operations 
before allowing new data to fill the output locations. 
Assignment of a functional unit to a specific algorithm 
primitive operation is made by the GRM only when all 
inputs required by the operation are available in global 
memory and a functional unit is available. 

The algorithm to be executed has its data flow 
represented in a directed graph termed the alcorithm 
directed graph (ADG). The ADG provides a description 
of the operand data flow and operation sequence required 
bv i ho algorithm decomposition. Vertices of the ADG 
,\rc in a one-to-one correspondence with each occurrence 
of a primitive operation. The ADG contains an edge (i.j) 
directed from vertex i to vertex j if the output of 
primitive operation i is an input operand for primitive 
operation j. When constructing an algorithm graph. 
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ices or nodes ( primitive operations) are displayed as 
es. and edges (input-output signals) are displayed as 
:ted line segments connecting appropriate vertices, 
ces and sinks for input and output signals arc 
esented as squares. Sources from constants are not 
illy included in the algorithm graph; however, 
tgles are used for this purpose when necessary. 

To illustrate, consider the computations for the 
? equation 

x(k) = Axik-t) + Bu(k) 
output equation 

y(k) = Cx(k) 

re x is a p— vector, u is an m-vector, y is an r-vector, 
A and B are constant matrices The primitive 
ations are defined as matrix multiplication and 
or addition. The algorithm directed graph for this 
rithm decomposition is shown in Figure 2. Note that 
\ edge is labeled with the corresponding operands and 
nodes are labeled to indicate the associated 
putationai operation. 

Petri-nets have been established as an approp- 
? model for describing systems defined by some 
icnce of events. Without argument, the algorithm 
cteri graph satisfies this general aspect. Further, 
e computers need to communicate and be controlled 
the occurrence of certain events, the Petri-net 
imes a suitaole theoretical vehicle for the ATAMM 
lei. Certain physical characteristics of the class of 
dents under consideration lead to a simplified Petri— 
representation. (For a formal description of Pom- 
features. the reader is referred to references [10-12].) 

Considering the data flow in an algorithm uirccted 
>h. the execution of a primitive operation is prccon- 
med on the availability of input signals (or 
ands). This process may be directly modeled by a 
i-nct "transition" which is "enabled" for "firing" 
n input "places" to the transition are marked with 
.ons". Because the signal or data availability is a 
.ry condition, it is appropriate that the tokens are 
ted to the set (0.1} in order to associate places 
idiiions) to transaction? (events) in a binary way. A 
i— net having such restricted input ami output 
lions is called an ordinary Petri-net. The 
i probation of places in the system model developed 
is the availability of a signal. That is, the absence 
token indicates the absence of a data signal, and the 
encc of a token indicates the availability of a data 
al. Petri-nets having such restricted markings arc 
d safe or one— bounded Petri-nets. Finally, the 
mption is made that the algorithms under considcra- 
contain no conflict or decision making such as 
then-else" or "do-while" statements, thus limilinc 
IViri-nct places to having one input transition and 
output transition. This class of restricted l , ctri-ucts 
died marked graphs. Therefore, the Petri-nets used 
lis report are ordinary, safe market I graphs. 

Limiting the model for consideration of decision- 
algorithms is made because the resulting marked 
>h models arc better understood than general 
ri-nets and hold the potential for the development of 
ormance bounds for concurrent processing strategies. 

An algorithm marked graph ( A MCI) is a marked 
>h which represents a specific algorithm dccomposi— 
i and is identical in topology to the coin's ponding 
irithm directed graph. The AM(i represents the first 


component in the development of the ATAMM model. 
The construction rules and symbols are the same as the 
ADG except that the edges are marked with tokens to 
represent the availability of data. That is ? edge (i,j) is 
marked with a token if an output from primitive operator 
i is available as an input to primitive operator j. The 
presence of a token on an edge is indicated by a solid dot 
placed on the edge. The vertices correspond to 
transitions which may fire after being enabled by the 
availability of all input data tokens. The decomposed 
state equation represented in Figure 2 is also used to 
illustrate the AMG. It should be noted that the initial 
conditions for the recursion are represented by tokens on 
the loop edges. 

The AMG is a useful tool for representing decom- 
posed algorithms and for displaying data flow within an 
algorithm. However, the AMG does not identify proce- 
dures that a computing structure must manifest in order 
to perform the computing task. 

Algorithm requirements and the computing 
environment may now be integrated into a comprehen- 
sive Petri-net model to complete the ATAMM model. 
The model consists of a Petri— oet marked graph called 
the computational marked graph (CMG). The CMG 
displays the data flow ana control flow required to 
implement a decomposed algorithm in a multiprocessor 
data flow computer architecture. Before defining this 
model, it is helpful to define an intermediate graph called 
the node marked graph (NMG), [13]. 

A NMG is a Petri-net representation of the 
computing behavior of a FUN for each AMG operation. 
Three primary activities, reading of input data from 
global memory, processing of input data to compute an 
output, and writing of output data to global memory, are 
represented as transitions (vertices) in the NMG. Data 
and control flow paths are represented as places (edges), 
and the presence of signals is notated by tokens marking 
appropriate edges. The conditions for firing the process 
and write transitions of the NMG are as defined for a 
general Petri-net. while the read transition has one 
additional condition for firing. In addition to having a 
token present on each incoming signal edge, a functional 
unit must be available for assignment to the primitive 
operation before the read node can fire. Once assigned, 
the functional unit is used to implement the read, 
process, and write operations before being returned to a 
queue of available FUNs. 

The NMG of interest in this paper requires control 
signals indicating that empty data containers are 
available to receive new output as input edges to the read 
transition. Therefore, initiation of the node operation 
requires not only the availability of input data and a 
functional unit, but also the availability of empty output 
data containers in global memory. This model ’is shown 
in Figure 4. 

A computational marked Graph (CMG) is 
constructed from a particular AMG and the NMG 
according to the following rules. 

1. Source and sink nodes in the AMG are repre- 
sented by source and sink nodes in the CMG. 

2. Nodes corresponding to primitive operations in 
the AMG are represented by NMGs in the CMG. 

3. Edges in the AMG are represented by edge 
pairs, one forward directed for data flow and one 
backward directed for control flow, in the CMG. 

The play of the CMG proceeds according to the 
following granh rules. 

1. A node is enabled when all incoming edges are 
maikod with a token. An enabled node fires bv encum- 
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“'"rr 1 u»* *. 1 **. 1 *tf, s^ N wh ' n 

node of an NMG 'a enaf^^ an^ ^ j- irea t h e read node. 
iVof?Sn» asaijrad 10 an NMC until completion of 

fjtog 3 "b. »nfe 

this logical assignment of the hUN is manage j 
GRM For illustration, the CMC '° <{£ 

« M '£2K -d 

control How which must occur in ^vi^es a 

mentation of the model proc • which to evaluate 
hardware independent context in , 

process performance^ ^ of the algorithm 

marked graph. the I . nod ' ^ «d ^the d»u flow 
rffi“ A^cmrJdt’cnptlon of the ATAMM 
model is shown in Figure 5 . 

A1AMH MODEL fiBAEll HI ABACIE1USIKS 

The theoretical analysis of the CMG from tte 

' H °° nwever ’ Mvff41 

property are not^ below forjlan consisting of m 

piaces and n transitions. The m-vector M k « the 
marking vector resulting from the Jiring of ^sequence 
lUToSd in any ^directed circu.t of the CMC is 

invariaiu unckrtmnsuion firings. ^ 

marking vector,. Thru *. ,f« | 

— 

transition firing sequence* . . Thai ^ere 

The CMG is said to be consistent. iniu *, 

S uVffiMMSSS &SSM M 
back to M. ^ ^ [ or marking M if. for 

Ml pi- »»*«• — 

than one token. 


prnFORMAN ^ MEASURES 


I,, .his section, performance measures mdi eating 
computing sliced and throughput capg£ M jjgjj 

P° ' 'the^XlG^and'oiGl^Tlds i^ormat^n^s^essenti^ 
for m efncieutly matching algorithm d “°^^°entcd'in 
architecture implementations ^ 

VSSSdf and marked graphs 

l l ‘l' i« ume that R FUNs arc available for the 
algorithm execution. A computational tu^.s rotated 

£„^TT^^ocSr*h«. a corry dmg 
output data token is deposited at the output sin 


A mk acompl^ - 1 -^ ^“M^d SiS 
the task is completed. However, taw op 

completion do n ot *%• Md. IdltW 

iterative signal processing aig after an 

conditions for the n Taak completion is usually 

output has been ^ CMG by the return of the 

indicated in the AMG or tne ^ facilitate 

graph to some is assumed that 

mewurement of th^ghp^«P« ‘y , t diu sets. 

is referred to « 

in the graph, inis type »•’ on task 
vertical concurrency and h“ ‘ d ‘ r « « num ber of 
computing speed. It ^ rfoone< i 

primi K^ e eousirf“r driven Itorithm deromposiW and 
simultaneously in a even The second type of 

■*“ SI “ 

It is limited by the capacity of I P ^ ^ ^,3100*1 
additional task inputs. and y ^ following 

units available ^“rft'hm decomposition 

it is shown that the pro^ol a g^ ^^jency 

imposes bounds on naaaible in a given problem, 

and horizontal concurrency p«s 4 v 4 ij 4 ble, operation 
If sufficient computing **** hi -j jf the number of 
at these bounds bounds can not be 

computing resources trade— offs between tb« 

SSSJ awicunoicy ud kodpo..^ 

concurr ^^‘ r * performance measures for concurrent 

PTM-S-Sta* 2SSSw"^TSk™S 

is the computing time whic The performance 

and the corresponding u*k output. between 

measure TT is the computi 8 computation 

a task input and ^J^P^Vrforma^ measure 
associated with that >«k. ^normw ^ 

TOO is the computing time operating 

successive task outpuw wben P P parameters. 

periodically m steady ^tate^ Th* am P^d and 

TDK) and The\hird 

thus reflect of throughput capacity 

S*5r«nS Ike *«« <* “»“ moC5 ' 

wlicn compared I to TT. measures may now be 

-uJss Sg fi^ asa. c « 

token from the data input ^ Similarly, the lower 

the graph tothe dataoutpu . ■ jd l0 com pieie 

bound for TT is the s^rtBitime d ata 

all computing activity mmated by ‘b^njec^^^ ^ 

token from the data input »“«*■. 1 onlv a single 1 


imes 


token from tlie input sour« ■ & sin?le task 

arc the actual perform^ce timw heno^y , ro 

is active in the graph dur g manv computing 

horizontal concurrency), .j h j e (maximum vertical 

resources as are require opcr ating conditions, lower 

TT «. AkuUied by idcntity.d* 

A -4 



certain longest paths in a graph obtained from the 
algorithm marked graph. This new graph, called the 
modified algorithm marked graph, MAMG, is defined 
awl then used to determine lower bounds for TBIO and 

The construction of the modified AMG proceeds 
by the following rules. Let p ; be a place of of the AMG. 

directed from transition t f to transition t , which 

5,““ a t ° ken . °. f initial marking. Then the 

.MAGM may be obtained from the original AMG bv 

1. Deleting place p ( from the AMG; 

2. Adding place p.j. directed from the data input 
source to transition t Jt is added to G; 

3. Adding a new output sink Sj different from all 
other output sinks, and a new place pj.,, directed from 

transition t r to Sji and 

4. Repeating 1-j for each place of the AMG 
containing a token of the initial marking. 

Let Pj be the ith directed path in the MAMG 

from the data input source to the output sink. The lower 
Uiiind for TRIO is defined as 

tui°lb = Nla * ( 'HIV K 

where the maximum is taken over all paths P. in the 
MAGM and T(Pj) denote the sum of transition times for 
transitions contained in Pj. 

Let Pj be the ith directed path in the MAMG 

from the dau input source to any output sink. The lower 
bound for TT is defined as 

tt lb - Max { T(Pj) } 

where T(Pj) denote the sum of transition times of 
transitions contained in Pj, and the maximum is taken 
over all paths Pj in the MAGM. 

To illustrate. TBIO lb and TT LD are computed 

ior the AMG shown in Figure 2 for which the following 
. r.msi : ion times are assumed: T(l)=4. T(2)=t Tf3)=.i 
.inn Ti l i=6. The MAMG is shown in Figure 6. It may 
be easily shown that TBIO LB =lO and TT LB =U. 

A lower bound for the performance measure TRO 
s now determined from the CMC representing a 
^composed algorithm. It is assumed that operating 
condition.', are set to maximize horizontal concurrency. 
iu.it is. data tokens are continuously available at the 
data input source, and as many computing resources as 
needed can lie called to perform primitive operations. 

ith these conditions, the graph plavs iieriodicallv in 
steady -state. and TBO^ B is the shortest time possible 

lietwcen successive outputs. Let (V be flic ith directed 
circuit in the CMC. The notation T(C.) denotes the sum 
of transition times of transinons contained in C, and 
M(Cj) denotes the number of tokens contained in C-. 
Then. 1 

td ° ld = Max { T(Cj) / M(Cj) }, 

where the maximum is taken over all directed circuits in 
the l MG. 

,, „ The CMC in Figure 4 has many directed circuits. 

^However. the directed circuit which contains all NMG 


I 

nodes of transitions 2 and 4 contains only one token and 
maximizes the ratio T(C.) / M(C,). Therefore? “e 

jjjfj TBO P °“? le betWee " 3ucceS5ive o^Puts in this 
LO 

ST RATEGY FOR optimum Tf H E PERFnnvtAMrp 

Of interest is the development of an ooeratin* 
f me^ V r/ 0r th * ATA y M model which achieves optimum 
r^inf^I. f0r i l i a ? Ce w,th . a m,nimum number of computing 

fr rC of ^ Unatd - V :, lhis problem is equivalent to a 
class of scheduling problems which is known to be 

NP-complete. Thus, there exists no algorithm for 
obtaining, an optimum solution which is better than 

bcsL^nm* 1 ^ P°* s,ble solutions and then choosing the 
‘ . However, a suboptimal operating strategy 

reouire^ mo^Tth? 11 ?il‘ m tlme P erforma *ce, but possiblv 
requires more than the minimum number of computing 

r^ourecs. has been developed and is illustrated* thg 

When presented with continuously available innut 
data sets, the natural behavior of a data flow uchkmure 
results in operation where new data sets are accepted as 
rapidly as the available resources permit. TliaUs, the 
architecture naturally operates at high levels of 
horizontal concurrency with the possible loss of capability 
for achieving high levels of vertical concurrency This 

rftw TBO^Ro““h C ? 4ra f teriZ . ed by high thr °ughput 
rates, rBO-TBO LB , but relatively poor task computing 

speed so that TBIO»TBIO LB and TT»TT LD . In 

many signal processing and control applications, 0 it is 
important to achieve both high throughput rate and high 
task computing speeds. The suboptimal operating 

s?;hS^^i e !sfr c . ra ' , ' is ■” p " form * nce 

1. When R>R Max . operation achieves TBIO LQ , 
^LB’ TBO lb . R Mix is computed in implement- 

tbe minimum number of 
resources which insures maximum horizontal concurrency 
and maximum vertical concurrency under this strategy. 

-• hen R^j 4x >R>R^jj n , operation achieves 
TB!O ld and TT LB , but TBO>TBO LB . The strategy 

preserves task computing speed or vertical concurrency at 
tnc expense of throughput rate or horizontal concurrency 
.Min IS also com Puted in implementing the strategy, and 

represents tiie minimum number of resources needed to 
maintain vertical concurrency with limited horizontal 
concurrency. 

cvc mnJ il l r atC at i wh i ch new data is P res ®nted to the 
CMC must be limited. This is accomplished bv addin* a 

control transition connected in a directed circuit with the 
data input source. The control transition imposes a 
minimum delay of D time units between inputs. Deiav D 
is chosen according to the following rule- 
TOOLB R>R 

D = TB0 .\li„ 


- R Max >R>R Min 
TCE R Min >R>l - 


.Mill 

TCE denotes the total computing effort required to 
complete the task, and TBO Mjn , R^, and R Mjn are 

procedure ** ^ art tbe operating strategy design 
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LH 


is the 


illustrated in Figure 13. The results of these calculations 
are TBChr- (3)-4. 

The performance degradation « a foncuon i of R of 

improvement in thruput is available for K>ii Mjtx 
^ONCMISIOfi 

S3S. 

computer environmml. The A calculating 

shown to provide an chlracteristics and 

performance boun^ on ihrup ch AMM model 

SSS&- - 

becomes the basis of design for these structures. 
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Sion V Compute R^ ljtx . K\Un’ TD0 Mii^ * US1 ^ 
the resource utilization envelope. is deterrrun 

shown in Figure 12 for the example. R Max * c,,ua ° 
tlie largest resource requirement durine any time inie « 
wnhin the straiiv state operating period. R Min » 
mimmum number of = «, necessary yjj 
maximum vertical concumm - , | l0 maximum 

concurrency. This nutnlnr . . ufcc ,„ilization 

resource requirement indicate * p example problem. 

T C{ % mu| a |C n =3 The value of Tn<» Min "a. h 
resource number R between H MrtX Anil R N1l| , '"elusive, is 

determined by increasing tlu ' ° maximum 
resource "Ulizanon envelop un ul ^ i|ipul 
resource requirement is R. i L . 

delay to produce this resource requirement. R=3 arc 
example, the calculations of TBO Min 
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Figure 2. Algorithm Marked Graph - Example 1 
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Figure i. ATA.MM Node Marked Graph Model 
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Figure 1. Representative ATAMM Architecture 
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Figure I. Computational Marked Graph - Example ! 
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Figure 6. Modified AMG - Example 1 
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Figure 7. Algorithm Marked Graph - Example 2 



Figures. ATAMM Model Components 



Figure 8. Computational Marked Graph - Example 2 
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Figure 9. Graph Play With TB0=3 and Unlimited 
Functional Units 
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Figure 10. Resource Utilization Envelope - 
Example 2 



Figure 11. Graph Play With TBO-4 W/O 
Control Edges 
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Figure 12. Resource Envelope Overlay Diagram - 
TBO=3.0 
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Figure 13. Resource Envelope Overlay Diagram - 
TBO=4.0 
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MODELING AND PERFORMANCE BOUNDS FOR CONCURRENT PROCESSING 


Roland R. Mielke, John VV Stoughton and Sukhamoy Som 
Department of Electrical and Computer Engineering 
Old Dominion University 
Norfolk, Virginia 


ABSTR AC T 

The development of a new graph theoretic 
model for describing the relation between a 
decomposed algorithm and its execution in a 
multiprocessor environment is presented. Called 
ATAMM. the model consists of a set of Petri net 
marked graphs which incorporates the general 
specifications of a data flow architecture. The model 
is useful for representing decision-free algorithms 
having large-grained, computationally complex 
primitive operations. Performance measures of com- 
puting speed and throughput capacity are defined. 
The ATAMM model is used to develop analytically 
lower bounds for these quantities. 

I. IN TR ODUCTI ON 

The development of a new graph theoretic model for 
describing data and control flow associated with the 
execution of large-grained algorithms in a special 
distributed computing environment is presented. The 
mode! is identified by the acronym ATAMM which 
represents Algorithm lo Architecture Mapping NJodel. 
The purpose of such a model is to provide a basis for 
establishing rules for relating an algorithm to its execution 
in a multiprocessor environment. Specifications derived 
from the model lead directly to the description of a data 
flow architecture. The availability of the ATAMM model 
is important for at least three reasons. First, it provides a 
context in which to investigate algorithm decomposition 
strategies without the need to specify a specific computer 
architecture. Second, the model identifies the data flow 
and control dialog required of any data flow architecture 
which implements the algorithm/ And third, the model 

P rovides a basis for calculating analytically performance 
ounds for computing speed and throughput capacity. 

The problem domain of the ATAMM model consists 
of decision free algorithms with computationally complex 
primitive operations which are assumed to be implemented 
in a dedicated data flow environment. The algorithms are 
such as may be found in (but not limited to) large scale 
signal processing and control applications. I he 
anticipated multiprocessor environment is assumed to 
consist of two to twenty processing elements for concurrent 
execution of the various algorithm primitives. 

The development of new computer architect ures 
based upon distributed, multiprocessor organizations ( 1 ? . 
[2] is motivated mainly by the requirement for increased 
speed and greater throughput capability in complex signal 
processing applications f;i[. Recent" advances in the 
production of high-density microelectronics ' ha> mailt' 


possible the construction of parallel architectures 
consisting of identical, special purpose computing elements 
[5J. A number of models for describing the behavior of 
algorithms in this setting have been developed [61— (81 
However, these models represent only the data flow and do 
not adequately display the complex issues of 
communication and control flow which must occur in any 
realization of the model. For this reason, it has been 
difficult to investigate how to effectively match the 
decomposition and scheduling of algorithms to the 
structure and control of parallel architectures. The 
importance of better understanding the relationship 
between algorithms and architectures is only now 
becoming recognized (9). 

In Section II of the paper, the modeling process to 
describe algorithms in data flow architectures, ATAMM, is 
presented. The model consists of three Petri net marked 
graphs called the algorithm marked graph (AMG), the 
node marked graph (NMG), and the computational 
marked graph (CMG). In Section III, time performance 
measures for concurrent processing are defined. The 
ATAMM model is used as the basis for calculating 
analytically lower bounds for these performance measures. 
An example is presented to illustrate these concepts, and 
the results of experimental runs on actual multiprocessor 
hardware are reported. 

ATAMM MO DEL DEVELOPMENT 

In this section the ATAMM model to describe 
concurrent processing of decomposed algorithms is 
presented. The model consists of a set of Petri net marked 
graphs which incorporate general specifications of 
communication and processing associated with each 
computational event in a data flow architecture. First, a 
clmilod description of the problem context is stated. This 
is followed by the definition of the ATAMM model 
consisting of the algorithm marked graph, the node 
marked graph, and the computational marked graph. 
Some familiarity with Petri nets (10] and marked graphs 
[11] is assumed in this presentation. 

J he problems of interest are decision— free, 
computationally complex problems as are often found in 
signal processing and control applications. A problem 
description normally results in the definition of a function 
given by the triple (X.Y.F). The set X represents the set 
of admissible inputs, the set Y represents the set of 
admissible outputs, and F:X-> Y is the rule of corres- 
pond erne which unambiguously assigns exactly one 
element from Y u> each element of X. "Associated with a 
mttiput .it tonal problem is one or more algorithms An 
algorithm an explicit mathematical statement. e\pres>cd 
an i<rd* * *< ! set of primitive operation^, which explains 
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state equation 
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y(k) = Cx(k), 
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process and write transit 0 j transition has one 
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Plfure 2. ATAMM node merited graph model. 


A computational marked graph(CMG) is constructed 
from the AMG and the NMG by the following rules 

L Source^and sink nodes in the algorithm marked 
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In 


the construction of 


order to illustrate wwtnicuan oi a 
thr P »’lie tl ?h iU marked graph ’ the CMG corresponding to 
F^ure 8 3 h The m ^ ked gfaph , of Fi ? ure 1 is shown in 
• g -Jilt ~ m ? Ut4t, e n ^ marked graph is useful 


i ~ . , — 7 w.„ K u fc ouv/ Ua i mantra jrraDn IS usefii 

SsTwxLr f^Mv 'h Pl y* the data and control now which 
m^L^and hit* ha f “ ware i mplementaion of the model 
and beeause it provides a hardware independent 
context m which to evaluate process performance. 



) ATAMM i 


***• If Ml 


K&ss "”“*^'0 »< «- ATfflrta,-; 



n*.r« «. ATAMM MM nmmmu. 


,n - performan ce BOimng 

~„kJ!l e im P° rt4nce of ‘he ATAMM model is that it 
establishes a context in which to investicat* th» 
performance of decomposed algorithm in m”\mrJes^f 
data How architectures. In this ^tion peXSe 
measures indicating computing speed and throwh^ut 
capacity are defined. Bounds for^^ lantUii ^e 

a^d CU thi ed ana, y tlca,, y from ‘he algorithm marked graph 
and the computational marked graph. This information is 


$40 

B-4 











ORiGI*!/*.'. 

OF POOP. 


\UTY 


I 

is £ "".resting »?p»“ ll »" fp d '“ST12I 

receivTinvestigations of the performance of Petr, nets (12|, t 
(13] and marked graphs [14]. j 

It is assumed that a decomposed algorithm is 1 

SrsSH;S“ w 

when > wwrfjj : outju, ; tota jJ,^SS all 

rE«E‘2r^,J: 

SAiS £ p£i. & wi« *» i»p»“ 

from previous task calculations. 

Concurrency in this problem setting occurs 
wavs First, different functional units may ^ perlorm 

simultaneously several primitive operations belonging to a 
simultaneously v currency \ s referred to as 

fArlcuS^^ymical 

effect on task computing speed. U is 1 ^ performed 
Called ^“^““^ghpu’t 

Sa5Sfe%SS£SF&2 

K 1 rs ^33; &i s sees 

^ScT^n^Sl^curren^ possible in a given 
, problem. If sufficient wmputtog rewrces ^ h | v " 1 ‘ b be ; 

C |^ it ^rade-ofi^ Un bet w«n° l the 
SX ■ liSSSSSmmr and horizontal concurrency 
are possible. 

Three oerformance measures for concurrent 

p t ooJ.oi «'’}*«*. The fa ;t;« °pyya X, 

and TT, are indicators of computing speed an Darame ter, 
“e degree of vertical conw™^ J%£Sty Wto 

gs AWasrJiy 

and the corresponding task output. 


P> T pj‘ $ 

r P S«r»T 111 completion suocinud ».th thu 


” th ih “ 

task. 


pcfmiiioi, 1: TBQ. The “"S'MSk 

steady-state. 

The reminder of this section is devoted to developing 
lower bounds for these performance measures. 

t-. a denote an algorithm marked graph 
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is defined and then used to determine lower bounds for 
TBIO and TT. 

Bdu utian Modififf1 AUfl rillaa Let p i 1)6 a **** 

of G, directed from transition t f to transition i $ , m 

ss-vstf ass « -» ~ *■ 

following construction rule. 

1. Place pj is deleted from G. 

2. A new place p ir directed from the data input 
source to transition t s< is added to G. 

3 A new output sink s., different from all other 
output sinks, and a new place p i2 . directed from 
transition t f to s,. are added to G. 

4 The above rules are repeated for each place of G 
containing a token of the initial marking. 

Lower bounds for TBIO and TT are presented in 
Theorem 1 and Theorem 2 respective y. 
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where the maximum is taken over aJ I paths P. in graph 

g m 

Proof. Without loss of generality, let t^ be the last 
transition in all paths P. directed from the data input 
source to the data output sink. Transition tj- is enabled 
when each input place for tj- contains a token. Since by 
assumption a computing resource is available, t^ fires as 
soon as it becomes enabled. Let p^ be the last input place 
for t f to acquire a token, and let t be the input transition 
for place p . Continuing this labeling procedure results in 

a backward path construction process. This process is 
repeated, first at t , and then at each succeeding transition 

until the data input source is reached, identifying a path 
Pj. By the construction process for the path, it is clear 

that T(Pj) = Max { T(Pj) }, where the maximum is over 
all paths P. in Gy. It is also clear that TBIOj 0 can be 
no shorter than T(P.) so that TBIO^g > T(Pj). Since a 
computing resource is available when each transition in P 

is enabled, the time between input and corresponding 
output can be no longer than T(P.) so that 

TBI0 LB- T(P j>- Therefore, TBIO LQ = TfPp = Max 
( T(P|) }, where the maximum is over all paths P. in Gy. 
This completes the proof. 

Theorem 2: Lower Bo und for TT. Let be the ith 
directed path in Gy from the data input source to any 
output sink, and let T(P- ) denote the sum of transition 
times of transitions contained in P.. Then, 


To illustrate the application of Theorem 1 and 
Theorem 2, TBIO^q and TT^g are computed for the 

algorithm graph shown in Figure 1. For this example, the 
following transition times are assumed: T(l) = 4 
T(2) =1, T(3) = 5, and T(4) = 6. The modified 
algorithm graph corresponding to Figure 1 is shown in 
Figure 5. Ihe modified algorithm graph contains two 
paths directed from the data input source Sj to the data 

output sink Sq. Path Pj consists of edge set ( i, 2, 3, 4} 
with T(Pj) = 10, and path P^ consists of edge set {5—1. 3, 
4} with T(P 2 ) = 6. Therefore, since T(Pj) > T(P 2 ), path 
P, determines the lower bound for TBIO and TBIO^p = 

10. The modified algorithm graph contains two additional 
directed paths from the data input source S| to the output 

sink s & . Path Pj consists of edge set (1, 2, 6, 5-2} with 
T(Pj) = 11, and path P^ consists of edge set {5-1, 6, 5-2} 
with T(P 4 ) = 7. Since T(P 3 )>T(P,)>T(P 4 )>T(P 2 ), 
pat i P 3 determines the lower bound for TT and 

TT lEr" 



F'gur* 5. Modified algorithm graph lor Figurt 1. 


tt LB = Max { T ( P i> } 

where the maximum is taken over all paths P. in graph 
G M* 

Proof. By the construction rules for graph Gy, a task is 

initiated when input data tokens are input from the data 
input source, and is completed when all output sinks have 
accepted tokens. Therefore, TT is the time which elapses 
from injection of input tokens to the arrival of a token at 
the last fired output sink. Let T(P ) = Max{T(P-)}, P. in 

Gy, be the longest path time of paths from the data input 
source tj to any output sink, say Since a token must 
reach sink ^ before a task is completed, it follows that 
TT LB - T(P t ). Since a resource is available for each 
transition to fire when enabled, and since is the longest 
path in Gy, it also follows that FT L0 <T(P ( ). Therefore, 
= TfPjl = Max{T(Pj)}, where the maximum is 
over all paths p. in Gy. This completes the proof. 


Next a lower bound for the performance measure 
TBO is presented. Let G be a computational marked 
graph representing a decomposed algorithm. It is assumed 
that operating conditions for G are set to maximize 
horizontal concurrency. That is, data tokens are 
continuously available at the data input source, and as 
many computing resources as needed can be called to 
perform primitive operations. With these conditions, the 
graph plays periodically in steady-state, and TBO f D is 

the shortest time possible between successive outputs. 

Thoorem 3: Lower Bo und for TBO. Lei G be a 
computational marked graph and let C- be the ith directed 

circuit in G. The notation T(C\) denotes the sum of 
transition times of transitions contained in C-, and M(C-) 
denotes the number of tokens contained in Cj. Then, 

TBO lb = Max { T(C.) / M(C.) }, 

where the maximum is taken over all directed circuits in 
G 


Proof. Without loss of generality, let t^ be the output 
tran.'iiion in G so that an output is produced each time i ( - 
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completes firing. Then THO U) is the minimum firing 

period of transition t f . It is shown in (15] (pp ''S-bOi that 

the minimum firing period of each transition of a marked 
graph is given by Max{T(C->/N!(C j)}. where the 

maximum is taken over all directed circuits C in G. 

Therefore, the theorem follows. 


The computational marked graph shown in Figure 3 
is u^ed to illustrate Theorem 3. This CMG contains manv 
directed circuits. However, the directed circuit which 
contains all SMG nodes of transitions > and 4 contains 
only one token and maximizes the ratio I(< j) / M(v j). 

Therefore, the shortest time possible between successive 
outputs in this graph is TBO L0 = 7. 


The optimum time performance for this example 
algorithm is described by the following characteristics. 
The algorithm accepts an input and issues an output every 
7 time units. Each input requires a total of 1 1 time units 
of processing, and an output is issued 10 time units after 
the input is accepted. It can be shown by simulation that 
3 functional units are required to achieve this performance. 
The addition of more functional units will not improve the 
computing speed or throughput rate for this algorithm 
decomposition. 


IV. CONCLUSION 

A new model useful for understanding the 
relationship between decomposed algorithms and data flow 
architectures has been presented. Named ATAMM for 
Algorithm To Architecture Mapping Model, the model 
consists of Petri net marked graphs called the algorithm 
marked graph, the node marked graph, and the 
computational marked graph. Time performance measures 
of time between input and output (TDIO), task time 
(TT), and lime between outputs (TBO) were defined. 
Then lower bounds for the performance measures were 
calculated analytically from the modified algorithm graph 
and the computational marked graph. An example to 
illustrate these concepts was presented. 

Simulation tools and an actual hardware prototype 
have been developed to test and validate the AIAMM 
model. The simulation software package [16) consists of a 
PC— based computer model of the CMG. Algorithms are 
entered to the package by specifying the algorithm marked 
graph, and simulation output consists of a graphical 
display of the movement of tokens. An accompanying 
diagnostic software package [17] automatically computes 
and displays performance measures and other performance 
data. A hardware prototype [18] has also l>eeti constructed 
to validate the ATAMM operating rules and to provide a 
benchmark for testing the simulation software 1 ne 
architecture is shown in Figure 6 and is one of several 
candidates which could be used to perform concurrent 
operations according to the ATAMM rules. A primary 
motivation for this particular design was the availability ol 
hardware. The system consists oi S 1 00 crates hav ing an 
Intel 8088 CPU card, multiple serial I/O channels, and 
32 K memory. An IBM/XT is used to host the system and 
to down load algorithm graph descriptions to the system. 
A number of decomposed algorithms, including those 
presented here, have been investigated using these tools. 
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Continuing research is designed to generalize the 
ATAMM model and is focused in three main areas. The 
present model assumes that all functional units are 
identical and that each is able to perform all primitive 
operations. An important extension is to model the 
situation where there are two or more different groupings 
of processors where each group is able to perform only a 
subset of the required primitive operations The present 
model represents only decision-free algorithms. Another 
important extension is to develop the capability to admit 
algorithms containing data-dependent branching points. 
Finally, methods for achieving optimum time performance 
are being studied in the context of the ATAMM model. 
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